PortraitCaptions

long-form source → ranked moments → captioned verticals → Drive, link-stable.

Overview

End-to-end: take a long-form recording (Twitch VOD, podcast session, livestream) and ship five ready-to-post 9:16 clips into a Google shared drive with caption quality that survives iteration. Every file in Drive has a stable webViewLink that never changes, no matter how many times you re-transcribe or re-render.

The pipeline is two project skills working in sequence: portrait-foreignfilm-clips (first pass, end-to-end) and caption-quality-boost (re-transcribe + replace in place). Both live in .claude/skills/.

Why this matters

Short-form vertical is where distribution happens — TikTok, Reels, Shorts, LinkedIn video. Most creators solve this one of two ways, and both hurt: (a) sit in a timeline editor for an hour per clip, or (b) trust an auto-crop tool that guillotines faces and drops the dock on top of captions. Neither scales.

This pipeline exists because the second you treat portrait as the default — not a post-hoc conversion from a landscape master — three things fall out:

The net effect: a 60-minute session becomes five posted-quality verticals in about half an hour, and you can re-transcribe with a better model two weeks later without anyone's link ever going stale.

Kickoff — drop a link, walk away

The pipeline triggers on any longform link or file path — YouTube, Twitch VOD, local .mp4. Defaults are committed to project memory: don't ask about count, selection, orientation, or layout. Run the full flow; pause at one checkpoint only.

defaultvalue
clip count5
selectionLLM-rank top 5 from whisper transcript (auto-confirmed)
orientationportrait 9:16 (1080×1920)
layoutA · face-top (alternates offered after completion)
face detectiondynamic per source — probe midpoint frame, cache face.json
captionsforeign-film yellow italic serif; color-coded per speaker if multi-speaker
destinationGoogle Drive shared drive Clips{session}-portrait-ff/
checkpointone — alternate layouts after the A batch uploads
queueable. Drop multiple links in order and they'll process sequentially without re-asking anything. Each source gets its own project folder under Clips/.

Timings — fire and forget

The only stages that need you are source and select. Everything from extract onward runs in the background, in parallel across all clips, and uploads itself to Drive as each one finishes. You kick it off and walk away.

hands-on ≈ 5 min · hands-off ≈ 10 min. Your attention budget for a five-clip batch is roughly one coffee. The machine spends the rest of the time rendering while you're somewhere else.

Approximate wall-clock times for a typical run: one 60-minute source, five clips averaging 45 seconds each, on an M-series Mac. Stages 3–6 run concurrently across all clips with one renderer + watcher invocation.

#stagemodetimenotes
1Source acquireyou0–5 minInstant if local; 2–5 min for yt-dlp on a 1-hour video.
2Select momentsyou2–20 min2 min with timestamps; 5 min via LLM rank; 15–20 min by hand.
── fire & forget · stages 3–6 run in parallel ──────────────────
3Extract portrait × 5bg · parallel~45sffmpeg filter pass; 5 concurrent, I/O-bound.
4Transcribe small × 5bg · parallel~2 minCPU-bound; 5 cores saturate, roughly real-time per clip.
4bTranscribe medium × 5bg · parallel~4 minBetter proper nouns. Still CPU-bound — model shared across workers.
5Caption composite × 5bg · parallel~2 minmoviepy + PIL per clip; libx264 encode is the floor.
6Drive upload × 5bg · streaming~1 minWatcher fires per .done sentinel; 5 uploads overlap.
First pass — hands-off wall~6 minWith small whisper. Stages 3–6 end-to-end, parallel.
First pass — total door-to-door~15 minIncluding selection and source fetch.
Iteration pass (medium + re-render + replace)bg · parallel~7 minNo re-selection. Drive files update in place — links unchanged.
first pass vs iteration. Selection happens once per source. Every quality pass after that is transcribe → render → replace in parallel — cheap enough to do weekly as whisper improves or you find a caption phrasing you like better.

Pipeline

┌────────────────────────┐ │ 1 · SOURCE │ long-form recording acquired yt-dlp / local └───────────┬────────────┘ screen rec ▼ ┌────────────────────────┐ │ 2 · SELECT │ rank moments, produce {start, end, title} best-clips └───────────┬────────────┘ TwelveLabs ▼ manual list ┌────────────────────────┐ │ 3 · EXTRACT (PORTRAIT) │ cut direct to 1080×1920 — face top · screen below ffmpeg -ss -to └───────────┬────────────┘ single filter pass ▼ ┌────────────────────────┐ │ 4 · TRANSCRIBE │ whisper word-timestamps per clip openai-whisper └───────────┬────────────┘ small → medium ▼ ┌────────────────────────┐ │ 5 · CAPTIONS (FF) │ foreign-film yellow italic serif, pre-rasterized PNGs Georgia Bold Italic └───────────┬────────────┘ #F2D21B, 4px stroke ▼ ┌────────────────────────┐ │ 6 · COMPOSITE │ layer captions + cover bar onto portrait clip moviepy + PIL └───────────┬────────────┘ 1080×1920 final ▼ ┌────────────────────────┐ │ 7 · DRIVE │ upload or update-in-place to Clips shared drive gws drive files └────────────────────────┘ {session}-portrait-ff

This guide is written in the order of the pipeline. If you already have landscape -captioned clips on disk, skip to #portrait. If you're starting from a raw recording, begin at #source.

default orientation: portrait. Every stage in this pipeline outputs 1080×1920. Landscape clips are only produced when explicitly requested.
stage 1

Source — acquire the long-form

Two common sources: a Twitch VOD / YouTube upload, or a locally-recorded podcast session. Both end in the same place: one large .mp4 on disk.

# youtube / twitch
yt-dlp -f "bv*[height<=1080]+ba/b[height<=1080]" \
       -o "source/%(title)s.%(ext)s" \
       "https://youtu.be/<id>"

# or just drop a local recording into source/
cp ~/Movies/Session-2026-04-16.mp4 source/
iCloud trap. Don't stage sources on iCloud Drive — the FUSE layer times Chromium and Remotion out. Stage to ~/local/source/ or /tmp. Moviepy itself is fine with iCloud.
stage 2

Select — pick the moments

Three ways, pick one.

a · transcript + LLM (fastest)

Transcribe the whole long-form once, then ask Claude to rank the top N moments against a rubric: hook strength, standalone coherence, quotability, visible screen activity. Output: a JSON list of {start, end, slug, title}.

b · best-clips skill

Existing skill at .claude/skills/best-clips/. Scores long-form windows on visible coding activity + transcript energy. Good for stream recordings where the screen right-side moves.

c · manual list

For a curated show, write the timestamps by hand into a CSV.

# clip-list.csv
slug,start,end,title
experiment-until-you-beat-the-record,00:03:14,00:04:19,"Experiment until you beat the record"
i-wasnt-crazy-this-works,00:07:02,00:07:50,"I wasn't crazy — this works"
mcp-reliability-is-a-gamble,00:12:45,00:13:17,"MCP reliability is a gamble"
put-your-expertise-in-the-skills,00:21:30,00:22:19,"Put your expertise in the skills"
same-team-more-clients,00:33:12,00:34:00,"Same team, more clients"
stage 3

Extract — cut portrait clips direct from source

Portrait is the default — every clip goes straight to 1080×1920, never through a landscape intermediate. The layout transform (face crop + screen crop + cover bar) happens during extraction, not after.

mkdir -p clips-portrait
while IFS=, read -r slug start end title; do
  [[ "$slug" == "slug" ]] && continue
  ffmpeg -y -ss "$start" -to "$end" -i "source/session.mp4" \
    -filter_complex "
      [0:v]crop=260:260:1370:790,scale=1080:1080[face];
      [0:v]crop=1340:1080:0:0,scale=1080:-2,crop=1080:840[screen];
      [face][screen]vstack
    " \
    -c:v libx264 -preset medium -crf 20 \
    -c:a aac -b:a 128k \
    "clips-portrait/${slug}.mp4"
done < clip-list.csv

Output: five 1080×1920 clips, face on top, screen below. Face PiP is already baked into the source recording at roughly (1370, 790) with size 260×260; tune those numbers per session.

portrait is the default. Don't produce landscape intermediates and transform later. Landscape is the exception — only cut 16:9 when the user explicitly calls for it (YouTube long-form, desktop player, etc.). For everything else: 9:16 from the first frame.
stage 4

Transcribe — word-level timing

openai-whisper on CPU. small is ~real-time; medium is ~3× slower but meaningfully better on proper nouns and jargon — start with small on the first pass, upgrade in iteration.

python3 .claude/skills/portrait-foreignfilm-clips/scripts/transcribe.py \
        landscape-masters/

# writes landscape-masters/_transcripts/{slug}.json — whisper raw, word_timestamps=true
stage 5

Burn-in v1 — the landscape master caption

Optional but conventional in this project. Many upstream clips already ship with a yellow all-caps caption burned into the landscape master. If yours do, call that folder *-captioned/ and continue. If not, you can skip straight to the portrait stage — the foreignfilm caption layer in stage 7 stands on its own.

The cover bar in stage 6 is sized to fully obscure any existing burned-in captions at y ≈ 860–890 of the 1080-tall source. If your source has no prior captions, you can drop or shrink the bar.

stage 6

Portrait — face top, screen below

The layout transform. Two independent crops from the same 1662×1080 landscape master, vertically stacked into a 1080×1920 canvas.

1662 ────────────────► ◄── 1080 ──► ┌────────────────────────────┐ ┌──────────┐ ─┐ │ │ │ │ │ │ (screen region) │ ── crop 1340×1080 ──►│ face │ │ 1080 │ │ scale │ 1080² │ │ │ ┌──PiP┐│ center-crop │ │ │ │ │260² ││ ├──────────┤ ─┘ └─────────────────────┴─────┘│ │ │ ─┐ │ │ screen │ │ crop 260² @ (1370, 790) ──┘ │ 1080×840 │ │ 840 │ │ │ ├──────────┤ ─┘ │██████████│ ─┐ │cover bar │ │ 220 (1700–1920) │ solid blk│ │ └──────────┘ ─┘
layersizey rangesource
face1080×10800–1080crop 260² @ (1370, 790), upscaled 4×
screen1080×8401080–1920crop 1340×1080, scaled to fit width, center-cropped
cover bar1080×2201700–1920solid black, opacity 1.0 — hides any v1 captions
caption PNG≤1080×180~1780pre-rasterized per cue (stage 7)
face box is session-specific. Before batching, sample a midpoint frame and eyeball the PiP bounds. Podcast rigs drift the PiP position by ±40px between sessions.
stage 7

Captions — the foreign-film look

Homebrew's ffmpeg 8 ships without libass, the subtitles filter, or even drawtext. So SRT/ASS burn-in is off the table. Instead: pre-rasterise each cue to a transparent PNG with PIL, then composite with moviepy.

# group whisper words into screen-cue chunks
groups = []; cur = []
for w in words:
    cur.append(w)
    if len(cur) >= 5 or cur[-1].end - cur[0].start >= 2.5 \
       or len(" ".join(x.word for x in cur)) >= 34:
        groups.append(cur); cur = []

Each group becomes a PNG drawn with Georgia Bold Italic 60pt, fill #F2D21B, 4px black stroke, centered, wrapped at ~22 chars. Positioned y = 1920 - img_h - 30.

fieldvaluewhy
fontGeorgia Bold Italicforeign-film default; reads warm and serious
size60ptlegible at thumb scroll scale
colour#F2D21Bwarm yellow, pops on dark screens
stroke4px blacksurvives bright backgrounds without a box
chunk≤5 words / ≤2.5s / ≤34 charsTikTok-pace, readable before it moves
stage 8

Drive — upload & link-stable replace

gws drive files update --upload replaces the media content of an existing file. The fileId and every webViewLink you've already sent keep working. Never create during iteration.

# list the destination folder, build a name → id inventory
gws drive files list \
  --params '{"q":"\"<folder-id>\" in parents and trashed=false",
             "driveId":"0AI4JyAzsoqRJUk9PVA",
             "corpora":"drive",
             "includeItemsFromAllDrives":true,
             "supportsAllDrives":true,
             "pageSize":200,
             "fields":"files(id,name)"}'

# for each local mp4 — replace if match, create if new
cd "$OUT_DIR"
gws drive files update \
  --params "{\"fileId\":\"$ID\",\"supportsAllDrives\":true}" \
  --upload "$f" --upload-content-type video/mp4
gws path scoping. The CLI refuses --upload paths outside the current working directory. Always cd into the output folder and pass a basename.

Run it

Two-skill orchestration. Skill 1 does the first render and uploads. Skill 2 upgrades and replaces.

First pass

SESSION=measure-summit-2026
SRC="public/clips/portrait/podcast-clips/${SESSION}-captioned"
OUT="public/clips/portrait/podcast-clips/${SESSION}-portrait-ff"

python3 .claude/skills/portrait-foreignfilm-clips/scripts/transcribe.py          "$SRC"
python3 .claude/skills/portrait-foreignfilm-clips/scripts/render_portrait_ff.py  "$SRC" "$OUT"
bash    .claude/skills/portrait-foreignfilm-clips/scripts/upload_clips.sh        "$OUT" "${SESSION}-portrait-ff"

Iterate — better model, same links

python3 .claude/skills/caption-quality-boost/scripts/retranscribe.py             "$SRC" --model medium
python3 .claude/skills/portrait-foreignfilm-clips/scripts/render_portrait_ff.py  "$SRC" "$OUT"
bash    .claude/skills/caption-quality-boost/scripts/drive_replace.sh            "$OUT" "${SESSION}-portrait-ff"

Live replace (parallel)

Render in one shell; the watcher in another uploads each clip the instant its .done sentinel appears.

# shell 1 — renderer touches {out}.done after each clip
python3 .claude/skills/portrait-foreignfilm-clips/scripts/render_portrait_ff.py  "$SRC" "$OUT"

# shell 2 — watcher polls, replaces in place, writes .replaced
/bin/bash .claude/skills/caption-quality-boost/scripts/drive_replace_watch.sh    "$OUT" "${SESSION}-portrait-ff"

Quality levers

leverfrom → toliftcost
whisper modelsmall → mediumbig win on proper nouns, jargon~3× CPU time
LLM polish passraw → Claude-cleanedpunctuation, split run-ons, fix names~1 API call / clip
chunk length5w / 2.5s → 3w / 1.5stighter pacing, more beatsconfig only
caption size60pt → 72pteasier at thumb scaleconfig only
face crop tightness260² → 220²closer face, more emotionper-session tune
moment selectionmanual → best-clips skillfinds higher-energy hooks you'd skim past~1 min / hr of source

Layouts — A default, alternates on demand

A · face-top is the committed default. Face cropped from the source PiP, upscaled into the top 1080px. Screen cropped to the left ~1320 columns, scaled to 1080×840 in the bottom half. Cover bar 1080×220 at y=1700 hides platform UI, dock, and any baked-in captions.

After the A batch uploads, the runner asks once whether to render alternates. Pick any subset:

layoutbottom framinguse when
Ascreen cropped + center-fitdefault. Face reads clearly; screen compressed.
Bscreen center-cropped to 9:16 directlypure screen-content clips where face isn't useful.
Cfull screen fit + blurred duplicate backgroundsource has no clean face PiP, or screen content is cropped too aggressively in A.
Dface-top unchanged · bottom = full-screen floating card on blurred duplicateartistic variant — full screen visible, no content lost, more depth.

Alternates land in {session}-portrait-ff/alternates/layout-{b,c,d}/. Share links on the A originals stay stable.

Speaker-aware captions

Multi-speaker clips get color-coded captions per speaker, optionally with a matching font. Built on pyannote diarization over the source audio.

accuracy is mandatory. LLM-based speaker-identity bootstrap is unreliable — it guesses identity from text content and gets it wrong often enough that blind color mapping is worse than no color at all. Every multi-speaker source goes through a user-verification checkpoint before burn-in.

Flow for a multi-speaker source:

  1. Diarize the full source audio — output {speaker_0, speaker_1, …} timestamped turns.
  2. Speaker manifest — per detected speaker, extract a 3–5s clearest-audio sample + a video frame from that moment. Written to {source}.speakers/{id}.{wav,jpg}.
  3. Verification checkpoint — present each speaker's sample to the user. User maps speaker_id → person_name.
  4. Name → colour/font map — project-level registry. Known speakers inherit committed colours.
  5. Render — per caption cue, the dominant speaker drives colour and font. Cues that span a speaker switch split at the boundary.
  6. Per-clip override — if a rendered clip misattributes, an overrides.json entry re-renders that clip without redoing diarization.

Known speaker registry (HICAM):

namecolourhex
jordaaanyellow#f2d21b
colinorange#ff8c00
stevengreen#00e676

Verified mappings cache to {session}.speakers.json so re-renders skip the checkpoint. Single-speaker sources bypass diarization entirely and render with the default foreign-film yellow.

Multi-speaker podcast processing

HICAM-style podcasts ship as multi-ISO recordings — one dedicated microphone track per speaker plus multiple camera angles. Each ISO has its own signal characteristics (mic placement, gain staging, room noise). Quality is not constant across ISOs, and it differs speaker-to-speaker because each person has a different mic on their voice.

Before anything renders, the pipeline scores each ISO and picks the best source per speaker. Wrong ISO choice = muddy captions, mis-transcriptions, wrong color attribution. Audio quality is gating.

Audio — ISO quality step

scripts/hicam-iso-quality.py ingests all staged ISO WAVs for a session and grades each one on ffmpeg-measured signal:

metricfromwhat it tells you
mean_dbfsvolumedetectOverall loudness. Below -60 dB = effectively silent.
peak_dbfsvolumedetectHeadroom. Values near 0 dB suggest clipping.
silence_ratiosilencedetectFraction of duration below -40 dB. >0.95 = a dead track.
noise_floor_dbfsastatsResidual room/hiss. Combined with peak gives dynamic range.
dynamic_range_dbastatsWider = more speech dynamics; narrower = compressed/room-only.
graderubricA / B / C / F — rolls the above into a usable flag.
# check every ISO in a HICAM session
python3 scripts/hicam-iso-quality.py \
  --session public/clips/hicam/260316/hicam-session.json \
  --json public/clips/hicam/260316/iso-quality.json

# or ad-hoc across specific WAVs
python3 scripts/hicam-iso-quality.py mic1.wav mic2.wav cam1-audio.wav

Sample output from session 260316:

       name        grade        dur_s      mean_dB      peak_dB      silence
cam1-audio             B       7529.0        -30.2         -4.2        0.038
program-aud            F        900.0        -91.0        -91.0        1.000

The existing HICAM notes confirmed what the grader caught automatically: program-audio is silent at -91 dB; cam1-audio (camera room mic) is the only usable local track. Catch this in seconds, not after a failed transcription run.

Per-speaker ISO routing

On well-recorded sessions, each speaker has their own lavalier or hand mic. That mic is the correct ISO for that speaker — not the program mix, not the room mic. The pipeline builds a speaker_id → iso_path map after grading:

  1. Grade every ISO (step above).
  2. Drop tracks graded F (silent / clipped).
  3. Diarize the program mix to get speaker_id turns.
  4. For each diarized speaker, correlate which ISO has the highest speech energy during that speaker's turns — that ISO is their mic.
  5. Cache {session}.speakers.json including iso_path per speaker.
  6. At render time, extract each speaker's cue audio from their own ISO (never the program mix).
why this matters for captions. Transcribing speaker A from speaker A's own lavalier gives word-level timing and accuracy the program mix can't match — because the program mix has bleed, background, and compression that whisper hates. Per-speaker ISO transcription fixes misattribution and improves word-boundary timing directly.

Visual — active-speaker crop

Parallel to ISO routing, layout A's top region follows whoever is speaking right now. Two paths:

Combined verification checkpoint

Speaker identity drives three things at once: caption colour, caption font (optional), and top-crop source. The user verifies the mapping in one checkpoint, not three:

speaker_0  →  iso: mic1.wav  (grade A, mean -22 dB)
              x_range: 120–420  (composite frame)
              → name?   [jordaaan]

speaker_1  →  iso: mic2.wav  (grade A, mean -25 dB)
              x_range: 420–720
              → name?   [colin]

speaker_2  →  iso: cam3-audio.wav  (grade B, mean -31 dB)
              x_range: 720–1020
              → name?   [steven]

Once confirmed, colour/font/crop are locked. The cache in {session}.speakers.json skips this on re-renders.

Known pitfalls on multi-ISO sources

Queueing multiple sources

Drop multiple longform links in any turn. Each gets its own project folder under Clips/ and runs through the full kickoff flow independently. The runner chains them so downloads + transcriptions overlap where possible — next source's download starts while the current source is rendering.

# typical multi-source turn
make clips for these:
  https://youtu.be/abc123           # source 1
  https://twitch.tv/videos/456789  # source 2
  /Users/me/local-recording.mp4     # source 3

# result: three project folders in Clips/
#   source-1-slug-portrait-ff/
#   source-2-slug-portrait-ff/
#   source-3-slug-portrait-ff/
# each with 5 clips; only one post-completion alternate-layout prompt per source

Pitfalls

Appendix — file layout

project/
├── source/                                         # long-form recordings (gitignored)
├── landscape-masters/                              # extracted 1662×1080 cuts
│   └── _transcripts/{slug}.json                    # whisper output
├── public/clips/portrait/podcast-clips/
│   ├── {session}-captioned/{slug}-captioned.mp4    # landscape + v1 caption
│   └── {session}-portrait-ff/{slug}-ff.mp4         # final 1080×1920
├── .claude/skills/
│   ├── portrait-foreignfilm-clips/                 # first render + upload
│   └── caption-quality-boost/                      # re-transcribe + replace
└── DOCUMENTATION/PORTRAIT-CAPTIONS-GUIDE.html      # this doc