End-to-end: take a long-form recording (Twitch VOD, podcast session, livestream) and ship five ready-to-post 9:16 clips into a Google shared drive with caption quality that survives iteration. Every file in Drive has a stable webViewLink that never changes, no matter how many times you re-transcribe or re-render.
The pipeline is two project skills working in sequence: portrait-foreignfilm-clips (first pass, end-to-end) and caption-quality-boost (re-transcribe + replace in place). Both live in .claude/skills/.
Short-form vertical is where distribution happens — TikTok, Reels, Shorts, LinkedIn video. Most creators solve this one of two ways, and both hurt: (a) sit in a timeline editor for an hour per clip, or (b) trust an auto-crop tool that guillotines faces and drops the dock on top of captions. Neither scales.
This pipeline exists because the second you treat portrait as the default — not a post-hoc conversion from a landscape master — three things fall out:
PATCHes the existing Drive file. When a sponsor, editor, or collaborator has a link, that link keeps working — and the quality keeps going up on the other end of it.The net effect: a 60-minute session becomes five posted-quality verticals in about half an hour, and you can re-transcribe with a better model two weeks later without anyone's link ever going stale.
The pipeline triggers on any longform link or file path — YouTube, Twitch VOD, local .mp4. Defaults are committed to project memory: don't ask about count, selection, orientation, or layout. Run the full flow; pause at one checkpoint only.
| default | value |
|---|---|
| clip count | 5 |
| selection | LLM-rank top 5 from whisper transcript (auto-confirmed) |
| orientation | portrait 9:16 (1080×1920) |
| layout | A · face-top (alternates offered after completion) |
| face detection | dynamic per source — probe midpoint frame, cache face.json |
| captions | foreign-film yellow italic serif; color-coded per speaker if multi-speaker |
| destination | Google Drive shared drive Clips → {session}-portrait-ff/ |
| checkpoint | one — alternate layouts after the A batch uploads |
Clips/.
The only stages that need you are source and select. Everything from extract onward runs in the background, in parallel across all clips, and uploads itself to Drive as each one finishes. You kick it off and walk away.
Approximate wall-clock times for a typical run: one 60-minute source, five clips averaging 45 seconds each, on an M-series Mac. Stages 3–6 run concurrently across all clips with one renderer + watcher invocation.
| # | stage | mode | time | notes |
|---|---|---|---|---|
| 1 | Source acquire | you | 0–5 min | Instant if local; 2–5 min for yt-dlp on a 1-hour video. |
| 2 | Select moments | you | 2–20 min | 2 min with timestamps; 5 min via LLM rank; 15–20 min by hand. |
| ── fire & forget · stages 3–6 run in parallel ────────────────── | ||||
| 3 | Extract portrait × 5 | bg · parallel | ~45s | ffmpeg filter pass; 5 concurrent, I/O-bound. |
| 4 | Transcribe small × 5 | bg · parallel | ~2 min | CPU-bound; 5 cores saturate, roughly real-time per clip. |
| 4b | Transcribe medium × 5 | bg · parallel | ~4 min | Better proper nouns. Still CPU-bound — model shared across workers. |
| 5 | Caption composite × 5 | bg · parallel | ~2 min | moviepy + PIL per clip; libx264 encode is the floor. |
| 6 | Drive upload × 5 | bg · streaming | ~1 min | Watcher fires per .done sentinel; 5 uploads overlap. |
| — | First pass — hands-off wall | — | ~6 min | With small whisper. Stages 3–6 end-to-end, parallel. |
| — | First pass — total door-to-door | — | ~15 min | Including selection and source fetch. |
| — | Iteration pass (medium + re-render + replace) | bg · parallel | ~7 min | No re-selection. Drive files update in place — links unchanged. |
This guide is written in the order of the pipeline. If you already have landscape -captioned clips on disk, skip to #portrait. If you're starting from a raw recording, begin at #source.
Two common sources: a Twitch VOD / YouTube upload, or a locally-recorded podcast session. Both end in the same place: one large .mp4 on disk.
# youtube / twitch
yt-dlp -f "bv*[height<=1080]+ba/b[height<=1080]" \
-o "source/%(title)s.%(ext)s" \
"https://youtu.be/<id>"
# or just drop a local recording into source/
cp ~/Movies/Session-2026-04-16.mp4 source/
~/local/source/ or /tmp. Moviepy itself is fine with iCloud.
Three ways, pick one.
Transcribe the whole long-form once, then ask Claude to rank the top N moments against a rubric: hook strength, standalone coherence, quotability, visible screen activity. Output: a JSON list of {start, end, slug, title}.
Existing skill at .claude/skills/best-clips/. Scores long-form windows on visible coding activity + transcript energy. Good for stream recordings where the screen right-side moves.
For a curated show, write the timestamps by hand into a CSV.
# clip-list.csv
slug,start,end,title
experiment-until-you-beat-the-record,00:03:14,00:04:19,"Experiment until you beat the record"
i-wasnt-crazy-this-works,00:07:02,00:07:50,"I wasn't crazy — this works"
mcp-reliability-is-a-gamble,00:12:45,00:13:17,"MCP reliability is a gamble"
put-your-expertise-in-the-skills,00:21:30,00:22:19,"Put your expertise in the skills"
same-team-more-clients,00:33:12,00:34:00,"Same team, more clients"
Portrait is the default — every clip goes straight to 1080×1920, never through a landscape intermediate. The layout transform (face crop + screen crop + cover bar) happens during extraction, not after.
mkdir -p clips-portrait
while IFS=, read -r slug start end title; do
[[ "$slug" == "slug" ]] && continue
ffmpeg -y -ss "$start" -to "$end" -i "source/session.mp4" \
-filter_complex "
[0:v]crop=260:260:1370:790,scale=1080:1080[face];
[0:v]crop=1340:1080:0:0,scale=1080:-2,crop=1080:840[screen];
[face][screen]vstack
" \
-c:v libx264 -preset medium -crf 20 \
-c:a aac -b:a 128k \
"clips-portrait/${slug}.mp4"
done < clip-list.csv
Output: five 1080×1920 clips, face on top, screen below. Face PiP is already baked into the source recording at roughly (1370, 790) with size 260×260; tune those numbers per session.
openai-whisper on CPU. small is ~real-time; medium is ~3× slower but meaningfully better on proper nouns and jargon — start with small on the first pass, upgrade in iteration.
python3 .claude/skills/portrait-foreignfilm-clips/scripts/transcribe.py \
landscape-masters/
# writes landscape-masters/_transcripts/{slug}.json — whisper raw, word_timestamps=true
Optional but conventional in this project. Many upstream clips already ship with a yellow all-caps caption burned into the landscape master. If yours do, call that folder *-captioned/ and continue. If not, you can skip straight to the portrait stage — the foreignfilm caption layer in stage 7 stands on its own.
The cover bar in stage 6 is sized to fully obscure any existing burned-in captions at y ≈ 860–890 of the 1080-tall source. If your source has no prior captions, you can drop or shrink the bar.
The layout transform. Two independent crops from the same 1662×1080 landscape master, vertically stacked into a 1080×1920 canvas.
| layer | size | y range | source |
|---|---|---|---|
| face | 1080×1080 | 0–1080 | crop 260² @ (1370, 790), upscaled 4× |
| screen | 1080×840 | 1080–1920 | crop 1340×1080, scaled to fit width, center-cropped |
| cover bar | 1080×220 | 1700–1920 | solid black, opacity 1.0 — hides any v1 captions |
| caption PNG | ≤1080×180 | ~1780 | pre-rasterized per cue (stage 7) |
Homebrew's ffmpeg 8 ships without libass, the subtitles filter, or even drawtext. So SRT/ASS burn-in is off the table. Instead: pre-rasterise each cue to a transparent PNG with PIL, then composite with moviepy.
# group whisper words into screen-cue chunks
groups = []; cur = []
for w in words:
cur.append(w)
if len(cur) >= 5 or cur[-1].end - cur[0].start >= 2.5 \
or len(" ".join(x.word for x in cur)) >= 34:
groups.append(cur); cur = []
Each group becomes a PNG drawn with Georgia Bold Italic 60pt, fill #F2D21B, 4px black stroke, centered, wrapped at ~22 chars. Positioned y = 1920 - img_h - 30.
| field | value | why |
|---|---|---|
| font | Georgia Bold Italic | foreign-film default; reads warm and serious |
| size | 60pt | legible at thumb scroll scale |
| colour | #F2D21B | warm yellow, pops on dark screens |
| stroke | 4px black | survives bright backgrounds without a box |
| chunk | ≤5 words / ≤2.5s / ≤34 chars | TikTok-pace, readable before it moves |
gws drive files update --upload replaces the media content of an existing file. The fileId and every webViewLink you've already sent keep working. Never create during iteration.
# list the destination folder, build a name → id inventory
gws drive files list \
--params '{"q":"\"<folder-id>\" in parents and trashed=false",
"driveId":"0AI4JyAzsoqRJUk9PVA",
"corpora":"drive",
"includeItemsFromAllDrives":true,
"supportsAllDrives":true,
"pageSize":200,
"fields":"files(id,name)"}'
# for each local mp4 — replace if match, create if new
cd "$OUT_DIR"
gws drive files update \
--params "{\"fileId\":\"$ID\",\"supportsAllDrives\":true}" \
--upload "$f" --upload-content-type video/mp4
--upload paths outside the current working directory. Always cd into the output folder and pass a basename.
Two-skill orchestration. Skill 1 does the first render and uploads. Skill 2 upgrades and replaces.
SESSION=measure-summit-2026
SRC="public/clips/portrait/podcast-clips/${SESSION}-captioned"
OUT="public/clips/portrait/podcast-clips/${SESSION}-portrait-ff"
python3 .claude/skills/portrait-foreignfilm-clips/scripts/transcribe.py "$SRC"
python3 .claude/skills/portrait-foreignfilm-clips/scripts/render_portrait_ff.py "$SRC" "$OUT"
bash .claude/skills/portrait-foreignfilm-clips/scripts/upload_clips.sh "$OUT" "${SESSION}-portrait-ff"
python3 .claude/skills/caption-quality-boost/scripts/retranscribe.py "$SRC" --model medium
python3 .claude/skills/portrait-foreignfilm-clips/scripts/render_portrait_ff.py "$SRC" "$OUT"
bash .claude/skills/caption-quality-boost/scripts/drive_replace.sh "$OUT" "${SESSION}-portrait-ff"
Render in one shell; the watcher in another uploads each clip the instant its .done sentinel appears.
# shell 1 — renderer touches {out}.done after each clip
python3 .claude/skills/portrait-foreignfilm-clips/scripts/render_portrait_ff.py "$SRC" "$OUT"
# shell 2 — watcher polls, replaces in place, writes .replaced
/bin/bash .claude/skills/caption-quality-boost/scripts/drive_replace_watch.sh "$OUT" "${SESSION}-portrait-ff"
| lever | from → to | lift | cost |
|---|---|---|---|
| whisper model | small → medium | big win on proper nouns, jargon | ~3× CPU time |
| LLM polish pass | raw → Claude-cleaned | punctuation, split run-ons, fix names | ~1 API call / clip |
| chunk length | 5w / 2.5s → 3w / 1.5s | tighter pacing, more beats | config only |
| caption size | 60pt → 72pt | easier at thumb scale | config only |
| face crop tightness | 260² → 220² | closer face, more emotion | per-session tune |
| moment selection | manual → best-clips skill | finds higher-energy hooks you'd skim past | ~1 min / hr of source |
A · face-top is the committed default. Face cropped from the source PiP, upscaled into the top 1080px. Screen cropped to the left ~1320 columns, scaled to 1080×840 in the bottom half. Cover bar 1080×220 at y=1700 hides platform UI, dock, and any baked-in captions.
After the A batch uploads, the runner asks once whether to render alternates. Pick any subset:
| layout | bottom framing | use when |
|---|---|---|
| A | screen cropped + center-fit | default. Face reads clearly; screen compressed. |
| B | screen center-cropped to 9:16 directly | pure screen-content clips where face isn't useful. |
| C | full screen fit + blurred duplicate background | source has no clean face PiP, or screen content is cropped too aggressively in A. |
| D | face-top unchanged · bottom = full-screen floating card on blurred duplicate | artistic variant — full screen visible, no content lost, more depth. |
Alternates land in {session}-portrait-ff/alternates/layout-{b,c,d}/. Share links on the A originals stay stable.
Multi-speaker clips get color-coded captions per speaker, optionally with a matching font. Built on pyannote diarization over the source audio.
Flow for a multi-speaker source:
{speaker_0, speaker_1, …} timestamped turns.{source}.speakers/{id}.{wav,jpg}.speaker_id → person_name.overrides.json entry re-renders that clip without redoing diarization.Known speaker registry (HICAM):
| name | colour | hex |
|---|---|---|
| jordaaan | yellow | #f2d21b |
| colin | orange | #ff8c00 |
| steven | green | #00e676 |
Verified mappings cache to {session}.speakers.json so re-renders skip the checkpoint. Single-speaker sources bypass diarization entirely and render with the default foreign-film yellow.
HICAM-style podcasts ship as multi-ISO recordings — one dedicated microphone track per speaker plus multiple camera angles. Each ISO has its own signal characteristics (mic placement, gain staging, room noise). Quality is not constant across ISOs, and it differs speaker-to-speaker because each person has a different mic on their voice.
Before anything renders, the pipeline scores each ISO and picks the best source per speaker. Wrong ISO choice = muddy captions, mis-transcriptions, wrong color attribution. Audio quality is gating.
scripts/hicam-iso-quality.py ingests all staged ISO WAVs for a session and grades each one on ffmpeg-measured signal:
| metric | from | what it tells you |
|---|---|---|
| mean_dbfs | volumedetect | Overall loudness. Below -60 dB = effectively silent. |
| peak_dbfs | volumedetect | Headroom. Values near 0 dB suggest clipping. |
| silence_ratio | silencedetect | Fraction of duration below -40 dB. >0.95 = a dead track. |
| noise_floor_dbfs | astats | Residual room/hiss. Combined with peak gives dynamic range. |
| dynamic_range_db | astats | Wider = more speech dynamics; narrower = compressed/room-only. |
| grade | rubric | A / B / C / F — rolls the above into a usable flag. |
# check every ISO in a HICAM session
python3 scripts/hicam-iso-quality.py \
--session public/clips/hicam/260316/hicam-session.json \
--json public/clips/hicam/260316/iso-quality.json
# or ad-hoc across specific WAVs
python3 scripts/hicam-iso-quality.py mic1.wav mic2.wav cam1-audio.wav
Sample output from session 260316:
name grade dur_s mean_dB peak_dB silence
cam1-audio B 7529.0 -30.2 -4.2 0.038
program-aud F 900.0 -91.0 -91.0 1.000
The existing HICAM notes confirmed what the grader caught automatically: program-audio is silent at -91 dB; cam1-audio (camera room mic) is the only usable local track. Catch this in seconds, not after a failed transcription run.
On well-recorded sessions, each speaker has their own lavalier or hand mic. That mic is the correct ISO for that speaker — not the program mix, not the room mic. The pipeline builds a speaker_id → iso_path map after grading:
F (silent / clipped).speaker_id turns.{session}.speakers.json including iso_path per speaker.Parallel to ISO routing, layout A's top region follows whoever is speaking right now. Two paths:
speaker_id's x-range in the wide shot (manual label from a clear frame, or face-cluster by x-position). At each caption cue boundary, crop that speaker's region into the top 1080×1080. Hard-cut at the utterance boundary; no fade (keeps rendering deterministic).Speaker identity drives three things at once: caption colour, caption font (optional), and top-crop source. The user verifies the mapping in one checkpoint, not three:
speaker_0 → iso: mic1.wav (grade A, mean -22 dB)
x_range: 120–420 (composite frame)
→ name? [jordaaan]
speaker_1 → iso: mic2.wav (grade A, mean -25 dB)
x_range: 420–720
→ name? [colin]
speaker_2 → iso: cam3-audio.wav (grade B, mean -31 dB)
x_range: 720–1020
→ name? [steven]
Once confirmed, colour/font/crop are locked. The cache in {session}.speakers.json skips this on re-renders.
-91 dB file. The grader flags them before transcription wastes CPU.
Drop multiple longform links in any turn. Each gets its own project folder under Clips/ and runs through the full kickoff flow independently. The runner chains them so downloads + transcriptions overlap where possible — next source's download starts while the current source is rendering.
# typical multi-source turn
make clips for these:
https://youtu.be/abc123 # source 1
https://twitch.tv/videos/456789 # source 2
/Users/me/local-recording.mp4 # source 3
# result: three project folders in Clips/
# source-1-slug-portrait-ff/
# source-2-slug-portrait-ff/
# source-3-slug-portrait-ff/
# each with 5 clips; only one post-completion alternate-layout prompt per source
-vf subtitles=. Use the PIL+moviepy renderer./bin/bash (Homebrew 5.x) or a case statement.cd first.drive files create mints a new fileId and breaks every share link you sent. Use drive files update for iteration.small mishears names. If your content is jargon-dense, jump to medium for the first pass and skip the re-render cycle.project/
├── source/ # long-form recordings (gitignored)
├── landscape-masters/ # extracted 1662×1080 cuts
│ └── _transcripts/{slug}.json # whisper output
├── public/clips/portrait/podcast-clips/
│ ├── {session}-captioned/{slug}-captioned.mp4 # landscape + v1 caption
│ └── {session}-portrait-ff/{slug}-ff.mp4 # final 1080×1920
├── .claude/skills/
│ ├── portrait-foreignfilm-clips/ # first render + upload
│ └── caption-quality-boost/ # re-transcribe + replace
└── DOCUMENTATION/PORTRAIT-CAPTIONS-GUIDE.html # this doc