End-to-end: take a long-form recording (Twitch VOD, podcast session, livestream) and ship five ready-to-post 9:16 clips into a Google shared drive with caption quality that survives iteration. Every file in Drive has a stable webViewLink that never changes, no matter how many times you re-transcribe or re-render.
The pipeline is two project skills working in sequence: portrait-foreignfilm-clips (first pass, end-to-end) and caption-quality-boost (re-transcribe + replace in place). Both live in .claude/skills/.
Short-form vertical is where distribution happens — TikTok, Reels, Shorts, LinkedIn video. Most creators solve this one of two ways, and both hurt: (a) sit in a timeline editor for an hour per clip, or (b) trust an auto-crop tool that guillotines faces and drops the dock on top of captions. Neither scales.
This pipeline exists because the second you treat portrait as the default — not a post-hoc conversion from a landscape master — three things fall out:
PATCHes the existing Drive file. When a sponsor, editor, or collaborator has a link, that link keeps working — and the quality keeps going up on the other end of it.The net effect: a 60-minute session becomes five posted-quality verticals in about half an hour, and you can re-transcribe with a better model two weeks later without anyone's link ever going stale.
The pipeline triggers on any longform link or file path — YouTube, Twitch VOD, local .mp4. Defaults are committed to project memory: don't ask about count, selection, orientation, or layout. Run the full flow; pause at one checkpoint only.
| default | value |
|---|---|
| clip count | 5 |
| selection | LLM-rank top 5 from whisper transcript (auto-confirmed) |
| orientation | portrait 9:16 (1080×1920) |
| layout | A · face-top (alternates offered after completion) |
| face detection | dynamic per source — probe midpoint frame, cache face.json |
| captions | foreign-film yellow italic serif; color-coded per speaker if multi-speaker |
| destination | Google Drive shared drive Clips → {session}-portrait-ff/ |
| checkpoint | one — alternate layouts after the A batch uploads |
Clips/.
The only stages that need you are source and select. Everything from extract onward runs in the background, in parallel across all clips, and uploads itself to Drive as each one finishes. You kick it off and walk away.
Approximate wall-clock times for a typical run: one 60-minute source, five clips averaging 45 seconds each, on an M-series Mac. Stages 3–6 run concurrently across all clips with one renderer + watcher invocation.
| # | stage | mode | time | notes |
|---|---|---|---|---|
| 1 | Source acquire | you | 0–5 min | Instant if local; 2–5 min for yt-dlp on a 1-hour video. |
| 2 | Select moments | you | 2–20 min | 2 min with timestamps; 5 min via LLM rank; 15–20 min by hand. |
| ── fire & forget · stages 3–6 run in parallel ────────────────── | ||||
| 3 | Extract portrait × 5 | bg · parallel | ~45s | ffmpeg filter pass; 5 concurrent, I/O-bound. |
| 4 | Transcribe small × 5 | bg · parallel | ~2 min | CPU-bound; 5 cores saturate, roughly real-time per clip. |
| 4b | Transcribe medium × 5 | bg · parallel | ~4 min | Better proper nouns. Still CPU-bound — model shared across workers. |
| 5 | Caption composite × 5 | bg · parallel | ~2 min | moviepy + PIL per clip; libx264 encode is the floor. |
| 6 | Drive upload × 5 | bg · streaming | ~1 min | Watcher fires per .done sentinel; 5 uploads overlap. |
| — | First pass — hands-off wall | — | ~6 min | With small whisper. Stages 3–6 end-to-end, parallel. |
| — | First pass — total door-to-door | — | ~15 min | Including selection and source fetch. |
| — | Iteration pass (medium + re-render + replace) | bg · parallel | ~7 min | No re-selection. Drive files update in place — links unchanged. |
This guide is written in the order of the pipeline. If you already have landscape -captioned clips on disk, skip to #portrait. If you're starting from a raw recording, begin at #source.
Two common sources: a Twitch VOD / YouTube upload, or a locally-recorded podcast session. Both end in the same place: one large .mp4 on disk.
# youtube / twitch
yt-dlp -f "bv*[height<=1080]+ba/b[height<=1080]" \
-o "source/%(title)s.%(ext)s" \
"https://youtu.be/<id>"
# or just drop a local recording into source/
cp ~/Movies/Session-2026-04-16.mp4 source/
~/local/source/ or /tmp. Moviepy itself is fine with iCloud.
Three ways, pick one.
Transcribe the whole long-form once, then ask Claude to rank the top N moments against a rubric: hook strength, standalone coherence, quotability, visible screen activity. Output: a JSON list of {start, end, slug, title}.
Existing skill at .claude/skills/best-clips/. Scores long-form windows on visible coding activity + transcript energy. Good for stream recordings where the screen right-side moves.
For a curated show, write the timestamps by hand into a CSV.
# clip-list.csv
slug,start,end,title
experiment-until-you-beat-the-record,00:03:14,00:04:19,"Experiment until you beat the record"
i-wasnt-crazy-this-works,00:07:02,00:07:50,"I wasn't crazy — this works"
mcp-reliability-is-a-gamble,00:12:45,00:13:17,"MCP reliability is a gamble"
put-your-expertise-in-the-skills,00:21:30,00:22:19,"Put your expertise in the skills"
same-team-more-clients,00:33:12,00:34:00,"Same team, more clients"
Portrait is the default — every clip goes straight to 1080×1920, never through a landscape intermediate. The layout transform (face crop + screen crop + cover bar) happens during extraction, not after.
mkdir -p clips-portrait
while IFS=, read -r slug start end title; do
[[ "$slug" == "slug" ]] && continue
ffmpeg -y -ss "$start" -to "$end" -i "source/session.mp4" \
-filter_complex "
[0:v]crop=260:260:1370:790,scale=1080:1080[face];
[0:v]crop=1340:1080:0:0,scale=1080:-2,crop=1080:840[screen];
[face][screen]vstack
" \
-c:v libx264 -preset medium -crf 20 \
-c:a aac -b:a 128k \
"clips-portrait/${slug}.mp4"
done < clip-list.csv
Output: five 1080×1920 clips, face on top, screen below. Face PiP is already baked into the source recording at roughly (1370, 790) with size 260×260; tune those numbers per session.
openai-whisper on CPU. small is ~real-time; medium is ~3× slower but meaningfully better on proper nouns and jargon — start with small on the first pass, upgrade in iteration.
python3 .claude/skills/portrait-foreignfilm-clips/scripts/transcribe.py \
landscape-masters/
# writes landscape-masters/_transcripts/{slug}.json — whisper raw, word_timestamps=true
Optional but conventional in this project. Many upstream clips already ship with a yellow all-caps caption burned into the landscape master. If yours do, call that folder *-captioned/ and continue. If not, you can skip straight to the portrait stage — the foreignfilm caption layer in stage 7 stands on its own.
The cover bar in stage 6 is sized to fully obscure any existing burned-in captions at y ≈ 860–890 of the 1080-tall source. If your source has no prior captions, you can drop or shrink the bar.
The layout transform. Two independent crops from the same 1662×1080 landscape master, vertically stacked into a 1080×1920 canvas.
| layer | size | y range | source |
|---|---|---|---|
| face | 1080×1080 | 0–1080 | crop 260² @ (1370, 790), upscaled 4× |
| screen | 1080×840 | 1080–1920 | crop 1340×1080, scaled to fit width, center-cropped |
| cover bar | 1080×220 | 1700–1920 | solid black, opacity 1.0 — hides any v1 captions |
| caption PNG | ≤1080×180 | ~1780 | pre-rasterized per cue (stage 7) |
Homebrew's ffmpeg 8 ships without libass, the subtitles filter, or even drawtext. So SRT/ASS burn-in is off the table. Instead: pre-rasterise each cue to a transparent PNG with PIL, then composite with moviepy.
# group whisper words into screen-cue chunks
groups = []; cur = []
for w in words:
cur.append(w)
if len(cur) >= 5 or cur[-1].end - cur[0].start >= 2.5 \
or len(" ".join(x.word for x in cur)) >= 34:
groups.append(cur); cur = []
Each group becomes a PNG drawn with Georgia Bold Italic 60pt, fill #F2D21B, 4px black stroke, centered, wrapped at ~22 chars. Positioned y = 1920 - img_h - 30.
| field | value | why |
|---|---|---|
| font | Georgia Bold Italic | foreign-film default; reads warm and serious |
| size | 60pt | legible at thumb scroll scale |
| colour | #F2D21B | warm yellow, pops on dark screens |
| stroke | 4px black | survives bright backgrounds without a box |
| chunk | ≤5 words / ≤2.5s / ≤34 chars | TikTok-pace, readable before it moves |
gws drive files update --upload replaces the media content of an existing file. The fileId and every webViewLink you've already sent keep working. Never create during iteration.
# list the destination folder, build a name → id inventory
gws drive files list \
--params '{"q":"\"<folder-id>\" in parents and trashed=false",
"driveId":"0AI4JyAzsoqRJUk9PVA",
"corpora":"drive",
"includeItemsFromAllDrives":true,
"supportsAllDrives":true,
"pageSize":200,
"fields":"files(id,name)"}'
# for each local mp4 — replace if match, create if new
cd "$OUT_DIR"
gws drive files update \
--params "{\"fileId\":\"$ID\",\"supportsAllDrives\":true}" \
--upload "$f" --upload-content-type video/mp4
--upload paths outside the current working directory. Always cd into the output folder and pass a basename.
Two-skill orchestration. Skill 1 does the first render and uploads. Skill 2 upgrades and replaces.
SESSION=measure-summit-2026
SRC="public/clips/portrait/podcast-clips/${SESSION}-captioned"
OUT="public/clips/portrait/podcast-clips/${SESSION}-portrait-ff"
python3 .claude/skills/portrait-foreignfilm-clips/scripts/transcribe.py "$SRC"
python3 .claude/skills/portrait-foreignfilm-clips/scripts/render_portrait_ff.py "$SRC" "$OUT"
bash .claude/skills/portrait-foreignfilm-clips/scripts/upload_clips.sh "$OUT" "${SESSION}-portrait-ff"
python3 .claude/skills/caption-quality-boost/scripts/retranscribe.py "$SRC" --model medium
python3 .claude/skills/portrait-foreignfilm-clips/scripts/render_portrait_ff.py "$SRC" "$OUT"
bash .claude/skills/caption-quality-boost/scripts/drive_replace.sh "$OUT" "${SESSION}-portrait-ff"
Render in one shell; the watcher in another uploads each clip the instant its .done sentinel appears.
# shell 1 — renderer touches {out}.done after each clip
python3 .claude/skills/portrait-foreignfilm-clips/scripts/render_portrait_ff.py "$SRC" "$OUT"
# shell 2 — watcher polls, replaces in place, writes .replaced
/bin/bash .claude/skills/caption-quality-boost/scripts/drive_replace_watch.sh "$OUT" "${SESSION}-portrait-ff"
| lever | from → to | lift | cost |
|---|---|---|---|
| whisper model | small → medium | big win on proper nouns, jargon | ~3× CPU time |
| LLM polish pass | raw → Claude-cleaned | punctuation, split run-ons, fix names | ~1 API call / clip |
| chunk length | 5w / 2.5s → 3w / 1.5s | tighter pacing, more beats | config only |
| caption size | 60pt → 72pt | easier at thumb scale | config only |
| face crop tightness | 260² → 220² | closer face, more emotion | per-session tune |
| moment selection | manual → best-clips skill | finds higher-energy hooks you'd skim past | ~1 min / hr of source |
A · face-top is the committed default. Face cropped from the source PiP, upscaled into the top 1080px. Screen cropped to the left ~1320 columns, scaled to 1080×840 in the bottom half. Cover bar 1080×220 at y=1700 hides platform UI, dock, and any baked-in captions.
After the A batch uploads, the runner asks once whether to render alternates. Pick any subset:
| layout | bottom framing | use when |
|---|---|---|
| A | screen cropped + center-fit | default. Face reads clearly; screen compressed. |
| B | screen center-cropped to 9:16 directly | pure screen-content clips where face isn't useful. |
| C | full screen fit + blurred duplicate background | source has no clean face PiP, or screen content is cropped too aggressively in A. |
| D | face-top unchanged · bottom = full-screen floating card on blurred duplicate | artistic variant — full screen visible, no content lost, more depth. |
Alternates land in {session}-portrait-ff/alternates/layout-{b,c,d}/. Share links on the A originals stay stable.
Multi-speaker clips get color-coded captions per speaker, optionally with a matching font. Built on pyannote diarization over the source audio.
Flow for a multi-speaker source:
{speaker_0, speaker_1, …} timestamped turns.{source}.speakers/{id}.{wav,jpg}.speaker_id → person_name.overrides.json entry re-renders that clip without redoing diarization.Known speaker registry (HICAM):
| name | colour | hex |
|---|---|---|
| jordaaan | yellow | #f2d21b |
| colin | orange | #ff8c00 |
| steven | green | #00e676 |
Verified mappings cache to {session}.speakers.json so re-renders skip the checkpoint. Single-speaker sources bypass diarization entirely and render with the default foreign-film yellow.
HICAM-style podcasts ship as multi-ISO recordings — one dedicated microphone track per speaker plus multiple camera angles. Each ISO has its own signal characteristics (mic placement, gain staging, room noise). Quality is not constant across ISOs, and it differs speaker-to-speaker because each person has a different mic on their voice.
Before anything renders, the pipeline scores each ISO and picks the best source per speaker. Wrong ISO choice = muddy captions, mis-transcriptions, wrong color attribution. Audio quality is gating.
scripts/hicam-iso-quality.py ingests all staged ISO WAVs for a session and grades each one on ffmpeg-measured signal:
| metric | from | what it tells you |
|---|---|---|
| mean_dbfs | volumedetect | Overall loudness. Below -60 dB = effectively silent. |
| peak_dbfs | volumedetect | Headroom. Values near 0 dB suggest clipping. |
| silence_ratio | silencedetect | Fraction of duration below -40 dB. >0.95 = a dead track. |
| noise_floor_dbfs | astats | Residual room/hiss. Combined with peak gives dynamic range. |
| dynamic_range_db | astats | Wider = more speech dynamics; narrower = compressed/room-only. |
| grade | rubric | A / B / C / F — rolls the above into a usable flag. |
# check every ISO in a HICAM session
python3 scripts/hicam-iso-quality.py \
--session public/clips/hicam/260316/hicam-session.json \
--json public/clips/hicam/260316/iso-quality.json
# or ad-hoc across specific WAVs
python3 scripts/hicam-iso-quality.py mic1.wav mic2.wav cam1-audio.wav
Sample output from session 260316:
name grade dur_s mean_dB peak_dB silence
cam1-audio B 7529.0 -30.2 -4.2 0.038
program-aud F 900.0 -91.0 -91.0 1.000
The existing HICAM notes confirmed what the grader caught automatically: program-audio is silent at -91 dB; cam1-audio (camera room mic) is the only usable local track. Catch this in seconds, not after a failed transcription run.
On well-recorded sessions, each speaker has their own lavalier or hand mic. That mic is the correct ISO for that speaker — not the program mix, not the room mic. The pipeline builds a speaker_id → iso_path map after grading:
F (silent / clipped).speaker_id turns.{session}.speakers.json including iso_path per speaker.Parallel to ISO routing, layout A's top region follows whoever is speaking right now. Two paths:
speaker_id's x-range in the wide shot (manual label from a clear frame, or face-cluster by x-position). At each caption cue boundary, crop that speaker's region into the top 1080×1080. Hard-cut at the utterance boundary; no fade (keeps rendering deterministic).Speaker identity drives three things at once: caption colour, caption font (optional), and top-crop source. The user verifies the mapping in one checkpoint, not three:
speaker_0 → iso: mic1.wav (grade A, mean -22 dB)
x_range: 120–420 (composite frame)
→ name? [jordaaan]
speaker_1 → iso: mic2.wav (grade A, mean -25 dB)
x_range: 420–720
→ name? [colin]
speaker_2 → iso: cam3-audio.wav (grade B, mean -31 dB)
x_range: 720–1020
→ name? [steven]
Once confirmed, colour/font/crop are locked. The cache in {session}.speakers.json skips this on re-renders.
-91 dB file. The grader flags them before transcription wastes CPU.
Drop multiple longform links in any turn. Each gets its own project folder under Clips/ and runs through the full kickoff flow independently. The runner chains them so downloads + transcriptions overlap where possible — next source's download starts while the current source is rendering.
# typical multi-source turn
make clips for these:
https://youtu.be/abc123 # source 1
https://twitch.tv/videos/456789 # source 2
/Users/me/local-recording.mp4 # source 3
# result: three project folders in Clips/
# source-1-slug-portrait-ff/
# source-2-slug-portrait-ff/
# source-3-slug-portrait-ff/
# each with 5 clips; only one post-completion alternate-layout prompt per source
A live three.js dashboard (pipeline-viz/) renders every stage of the pipeline as a lit platform in a dark scene. A clip's first frame rides along curved tubes between platforms in real time — the same JPEG you'd see if you opened the MP4 at frame 45. Runs 24/7 on the claw worker; reach it from any machine on Tailscale.
Hit http://100.82.244.127:5173/ (Tailscale) or http://172.16.11.133:5173/ (LAN). The dashboard connects to a WebSocket event stream on :8787/events and streams thumbnails from :8787/thumb?clip=&t=.
┌──────┐ ┌──────┐ ┌───────┐ ┌──────────┐ ┌────────┐ ┌─────────┐ ┌──────┐
│source│→ │select│→ │extract│→ │transcribe│→ │captions│→ │composite│→ │drive │
└──────┘ └──┬───┘ └───────┘ └──────────┘ └────────┘ └────┬────┘ └───┬──┘
│ │ │
└─────────────────────┐ ┌───────────────────────┘ │
▼ ▼ │
┌──────────┐ │
│apertureDb│ ◄──────── webViewLink ─────────┘
└──────────┘
Each platform shows the tool driving that stage (yt-dlp, ffmpeg, whisper, PIL, moviepy, gws) and pulses its ring color when an event fires. ApertureDB is offset as a sidecar — it's where every clip is cataloged, where dedupe queries run, and where Drive share links are backfilled after upload.
| file landing | event | edge the billboard travels |
|---|---|---|
workspace/source/*.mp4 | source_acquired | source → select |
workspace/moments/*.json | moments_selected | select → extract |
workspace/_transcripts/*.json | transcribe_complete | extract → transcribe |
workspace/clips/{session}-portrait-ff/*-ff.mp4 | composite_complete | captions → composite |
workspace/clips/{session}-portrait-ff/*-ff.replaced | drive_upload_complete | composite → drive |
| (catalog POST after composite) | catalog_write | composite → apertureDb |
| (backfill POST after drive) | catalog_write | drive → apertureDb |
| service | port | role |
|---|---|---|
vite | 5173 | serves the three.js dashboard |
ws.ts (tsx) | 8787 | chokidar watchers + /thumb endpoint + WebSocket broadcast |
catalog.py (docker) | 48788 (tunneled) | HTTP bridge to ApertureDB |
aperturedb-community (docker) | 45555 (tunneled) | vector + metadata catalog |
tunnel.sh (launchd) | — | SSH forward colima-VM ports → claw host |
youtube_poster.py (launchd, 60s tick) | — | drains the ScheduledPost queue |
Colima's lima guestagent does not auto-forward docker-published ports on this box, so services that need the DB talk to it through the tunnel.sh SSH forward (claw:48788 → VM:8788 → catalog container). Co-located workers on claw read/write through this path; remote clients only see the three.js dashboard.
Session { session_id, source_url, long_form_path, duration_s, ingested_at }
Clip { clip_id, clip_code ("CAPI-03a"), batch_code, session_id,
slug, start_s, end_s, grade, moment_score, caption_text,
hook_text, script_text, cta_text, principle,
drive_url, drive_file_id, layout, status, path }
Speaker { speaker_id, display_name, color_hex }
Batch { batch_id, name, principle_tag, cta_url, source_sheet_tab }
Metric { clip_id, platform, captured_at,
views, likes, comments, shares, saves, watch_time_s, ctr }
ScheduledPost { post_id, clip_code, platform, mode, scheduled_at,
status, caption, hashtags, thumbnail_path,
result_url, error, created_at, posted_at }
DescriptorSet clip_marengo_v1 # 1024-d · cosine · HNSW · for dedupe + semantic search
Naming scheme. Every clip gets a clip_code like CAPI-03a — two-to-four-letter batch tag, zero-padded sequence, single variant letter. Easy to say aloud, easy to grep, natural ordering. Inspired by the user's long-running Content Matrix (Hook# / Script# / CTA Letter) without inheriting its full column set.
After a clip lands in Drive with a stable webViewLink, posting happens in one of two modes. YouTube is fully automated via the Data API; Meta is deliberately human-in-the-loop to dodge the account-suspension risk that comes with pure-API Instagram posting.
| platform | mode | mechanism | why |
|---|---|---|---|
| YouTube | auto | Data API v3 videos.insert (resumable) | Quota-friendly, reliable, no inauthenticity flags |
| Meta (Reels/Feed) | assist | dashboard prepares caption + hashtags + thumbnail, copies to clipboard, deep-links instagram://camera | Pure-API posting flags business accounts; IG Reels API is missing stickers/music/polls anyway |
| TikTok | historical | (none — Content Posting API is approval-gated) | Past URLs stay in the catalog for insights; no new direct posting |
# enqueue a YouTube auto-post
curl -X POST http://localhost:48788/schedule \
-H 'content-type: application/json' -d '{
"clip_code": "CAPI-03a",
"platform": "youtube",
"scheduled_at": "2026-04-20T14:00:00Z",
"caption": "The hidden 30% of sales you are missing",
"hashtags": "#shorts #marketing"
}'
# returns { post_id: "CAPI-03a-youtube-1", mode: "auto", status: 0 }
# list everything in the queue
curl 'http://localhost:48788/schedule/list?status=queued'
# Meta assist — returns copy-ready payload, no post happens yet
curl -X POST http://localhost:48788/assist/meta \
-H 'content-type: application/json' -d '{ "clip_code": "CAPI-03a" }'
youtube_poster.py runs under launchd with a 60-second tick:
/schedule/list?status=queued → filter to platform=="youtube" AND scheduled_at <= now.status=posting → resumable upload via MediaFileUpload(chunksize=-1, resumable=True) → mark status=posted with result_url, or status=failed with error.
Quota. videos.insert costs ~1600 units; default daily quota is 10 000, so ≈6 uploads/day before you need to request an increase. Stagger scheduled_at across the day to avoid burn.
Copy pipeline-viz/server/credentials.example.env to credentials.env (gitignored) and fill in:
# YouTube Data API v3
YT_CLIENT_ID=
YT_CLIENT_SECRET=
YT_REFRESH_TOKEN=
YT_CHANNEL_ID=
# Meta Graph API (used by assist-mode for caption prep + future automation)
META_APP_ID=
META_APP_SECRET=
META_LONG_LIVED_USER_TOKEN=
META_INSTAGRAM_BUSINESS_ACCOUNT_ID=
META_FACEBOOK_PAGE_ID=
# Fetch cadence for the insights worker (views/likes/ctr)
INSIGHTS_POLL_MINUTES=60
INSIGHTS_FIRST_24H_POLL_MINUTES=15
YouTube OAuth mint-a-refresh-token flow: enable Data API v3 + YouTube Analytics API in Google Cloud Console, create an OAuth client (Desktop type), run gws auth youtube once, grant scopes, paste the refresh token into credentials.env. The analytics scope requires the channel owner's consent.
-vf subtitles=. Use the PIL+moviepy renderer./bin/bash (Homebrew 5.x) or a case statement.cd first.drive files create mints a new fileId and breaks every share link you sent. Use drive files update for iteration.small mishears names. If your content is jargon-dense, jump to medium for the first pass and skip the re-render cycle.project/
├── source/ # long-form recordings (gitignored)
├── landscape-masters/ # extracted 1662×1080 cuts
│ └── _transcripts/{slug}.json # whisper output
├── public/clips/portrait/podcast-clips/
│ ├── {session}-captioned/{slug}-captioned.mp4 # landscape + v1 caption
│ └── {session}-portrait-ff/{slug}-ff.mp4 # final 1080×1920
├── .claude/skills/
│ ├── portrait-foreignfilm-clips/ # first render + upload
│ └── caption-quality-boost/ # re-transcribe + replace
└── DOCUMENTATION/PORTRAIT-CAPTIONS-GUIDE.html # this doc