PortraitCaptions

End-to-end: take a long-form recording (Twitch VOD, podcast session, livestream) and ship five ready-to-post 9:16 clips into a Google shared drive with caption quality that survives iteration. Every file in Drive has a stable webViewLink that never changes, no matter how many times you re-transcribe or re-render.

The pipeline is two project skills working in sequence: portrait-foreignfilm-clips (first pass, end-to-end) and caption-quality-boost (re-transcribe + replace in place). Both live in .claude/skills/.

Why this matters

Short-form vertical is where distribution happens — TikTok, Reels, Shorts, LinkedIn video. Most creators solve this one of two ways, and both hurt: (a) sit in a timeline editor for an hour per clip, or (b) trust an auto-crop tool that guillotines faces and drops the dock on top of captions. Neither scales.

This pipeline exists because the second you treat portrait as the default — not a post-hoc conversion from a landscape master — three things fall out:

The net effect: a 60-minute session becomes five posted-quality verticals in about half an hour, and you can re-transcribe with a better model two weeks later without anyone's link ever going stale.

Kickoff — drop a link, walk away

The pipeline triggers on any longform link or file path — YouTube, Twitch VOD, local .mp4. Defaults are committed to project memory: don't ask about count, selection, orientation, or layout. Run the full flow; pause at one checkpoint only.

Timings — fire and forget

default	value
clip count	5
selection	LLM-rank top 5 from whisper transcript (auto-confirmed)
orientation	portrait 9:16 (1080×1920)
layout	A · face-top (alternates offered after completion)
face detection	dynamic per source — probe midpoint frame, cache `face.json`
captions	foreign-film yellow italic serif; color-coded per speaker if multi-speaker
destination	Google Drive shared drive `Clips` → `{session}-portrait-ff/`
checkpoint	one — alternate layouts after the A batch uploads

The only stages that need you are source and select. Everything from extract onward runs in the background, in parallel across all clips, and uploads itself to Drive as each one finishes. You kick it off and walk away.

Approximate wall-clock times for a typical run: one 60-minute source, five clips averaging 45 seconds each, on an M-series Mac. Stages 3–6 run concurrently across all clips with one renderer + watcher invocation.

Pipeline

┌────────────────────────┐ │ 1 · SOURCE │ long-form recording acquired yt-dlp / local └───────────┬────────────┘ screen rec ▼ ┌────────────────────────┐ │ 2 · SELECT │ rank moments, produce {start, end, title} best-clips └───────────┬────────────┘ TwelveLabs ▼ manual list ┌────────────────────────┐ │ 3 · EXTRACT (PORTRAIT) │ cut direct to 1080×1920 — face top · screen below ffmpeg -ss -to └───────────┬────────────┘ single filter pass ▼ ┌────────────────────────┐ │ 4 · TRANSCRIBE │ whisper word-timestamps per clip openai-whisper └───────────┬────────────┘ small → medium ▼ ┌────────────────────────┐ │ 5 · CAPTIONS (FF) │ foreign-film yellow italic serif, pre-rasterized PNGs Georgia Bold Italic └───────────┬────────────┘ #F2D21B, 4px stroke ▼ ┌────────────────────────┐ │ 6 · COMPOSITE │ layer captions + cover bar onto portrait clip moviepy + PIL └───────────┬────────────┘ 1080×1920 final ▼ ┌────────────────────────┐ │ 7 · DRIVE │ upload or update-in-place to Clips shared drive gws drive files └────────────────────────┘ {session}-portrait-ff

#	stage	mode	time	notes
1	Source acquire	you	0–5 min	Instant if local; 2–5 min for `yt-dlp` on a 1-hour video.
2	Select moments	you	2–20 min	2 min with timestamps; 5 min via LLM rank; 15–20 min by hand.
── fire & forget · stages 3–6 run in parallel ──────────────────
3	Extract portrait × 5	bg · parallel	~45s	ffmpeg filter pass; 5 concurrent, I/O-bound.
4	Transcribe `small` × 5	bg · parallel	~2 min	CPU-bound; 5 cores saturate, roughly real-time per clip.
4b	Transcribe `medium` × 5	bg · parallel	~4 min	Better proper nouns. Still CPU-bound — model shared across workers.
5	Caption composite × 5	bg · parallel	~2 min	moviepy + PIL per clip; libx264 encode is the floor.
6	Drive upload × 5	bg · streaming	~1 min	Watcher fires per `.done` sentinel; 5 uploads overlap.
—	First pass — hands-off wall	—	~6 min	With `small` whisper. Stages 3–6 end-to-end, parallel.
—	First pass — total door-to-door	—	~15 min	Including selection and source fetch.
—	Iteration pass (`medium` + re-render + replace)	bg · parallel	~7 min	No re-selection. Drive files update in place — links unchanged.

This guide is written in the order of the pipeline. If you already have landscape -captioned clips on disk, skip to #portrait. If you're starting from a raw recording, begin at #source.

Source — acquire the long-form

Two common sources: a Twitch VOD / YouTube upload, or a locally-recorded podcast session. Both end in the same place: one large .mp4 on disk.

iCloud trap. Don't stage sources on iCloud Drive — the FUSE layer times Chromium and Remotion out. Stage to ~/local/source/ or /tmp. Moviepy itself is fine with iCloud.

Select — pick the moments

a · transcript + LLM (fastest)

Transcribe the whole long-form once, then ask Claude to rank the top N moments against a rubric: hook strength, standalone coherence, quotability, visible screen activity. Output: a JSON list of {start, end, slug, title}.

b · best-clips skill

Existing skill at .claude/skills/best-clips/. Scores long-form windows on visible coding activity + transcript energy. Good for stream recordings where the screen right-side moves.

c · manual list

Extract — cut portrait clips direct from source

Portrait is the default — every clip goes straight to 1080×1920, never through a landscape intermediate. The layout transform (face crop + screen crop + cover bar) happens during extraction, not after.

Output: five 1080×1920 clips, face on top, screen below. Face PiP is already baked into the source recording at roughly (1370, 790) with size 260×260; tune those numbers per session.

Transcribe — word-level timing

openai-whisper on CPU. small is ~real-time; medium is ~3× slower but meaningfully better on proper nouns and jargon — start with small on the first pass, upgrade in iteration.

Burn-in v1 — the landscape master caption

Optional but conventional in this project. Many upstream clips already ship with a yellow all-caps caption burned into the landscape master. If yours do, call that folder *-captioned/ and continue. If not, you can skip straight to the portrait stage — the foreignfilm caption layer in stage 7 stands on its own.

The cover bar in stage 6 is sized to fully obscure any existing burned-in captions at y ≈ 860–890 of the 1080-tall source. If your source has no prior captions, you can drop or shrink the bar.

Portrait — face top, screen below

The layout transform. Two independent crops from the same 1662×1080 landscape master, vertically stacked into a 1080×1920 canvas.

1662 ────────────────► ◄── 1080 ──► ┌────────────────────────────┐ ┌──────────┐ ─┐ │ │ │ │ │ │ (screen region) │ ── crop 1340×1080 ──►│ face │ │ 1080 │ │ scale │ 1080² │ │ │ ┌──PiP┐│ center-crop │ │ │ │ │260² ││ ├──────────┤ ─┘ └─────────────────────┴─────┘│ │ │ ─┐ │ │ screen │ │ crop 260² @ (1370, 790) ──┘ │ 1080×840 │ │ 840 │ │ │ ├──────────┤ ─┘ │██████████│ ─┐ │cover bar │ │ 220 (1700–1920) │ solid blk│ │ └──────────┘ ─┘

Captions — the foreign-film look

layer	size	y range	source
face	1080×1080	0–1080	crop 260² @ (1370, 790), upscaled 4×
screen	1080×840	1080–1920	crop 1340×1080, scaled to fit width, center-cropped
cover bar	1080×220	1700–1920	solid black, opacity 1.0 — hides any v1 captions
caption PNG	≤1080×180	~1780	pre-rasterized per cue (stage 7)

Homebrew's ffmpeg 8 ships without libass, the subtitles filter, or even drawtext. So SRT/ASS burn-in is off the table. Instead: pre-rasterise each cue to a transparent PNG with PIL, then composite with moviepy.

Each group becomes a PNG drawn with Georgia Bold Italic 60pt, fill #F2D21B, 4px black stroke, centered, wrapped at ~22 chars. Positioned y = 1920 - img_h - 30.

Drive — upload & link-stable replace

field	value	why
font	Georgia Bold Italic	foreign-film default; reads warm and serious
size	60pt	legible at thumb scroll scale
colour	`#F2D21B`	warm yellow, pops on dark screens
stroke	4px black	survives bright backgrounds without a box
chunk	≤5 words / ≤2.5s / ≤34 chars	TikTok-pace, readable before it moves

gws drive files update --upload replaces the media content of an existing file. The fileId and every webViewLink you've already sent keep working. Never create during iteration.

Run it

Two-skill orchestration. Skill 1 does the first render and uploads. Skill 2 upgrades and replaces.

First pass

Iterate — better model, same links

Live replace (parallel)

Render in one shell; the watcher in another uploads each clip the instant its .done sentinel appears.

Quality levers

Layouts — A default, alternates on demand

lever	from → to	lift	cost
whisper model	`small → medium`	big win on proper nouns, jargon	~3× CPU time
LLM polish pass	raw → Claude-cleaned	punctuation, split run-ons, fix names	~1 API call / clip
chunk length	5w / 2.5s → 3w / 1.5s	tighter pacing, more beats	config only
caption size	60pt → 72pt	easier at thumb scale	config only
face crop tightness	260² → 220²	closer face, more emotion	per-session tune
moment selection	manual → best-clips skill	finds higher-energy hooks you'd skim past	~1 min / hr of source

A · face-top is the committed default. Face cropped from the source PiP, upscaled into the top 1080px. Screen cropped to the left ~1320 columns, scaled to 1080×840 in the bottom half. Cover bar 1080×220 at y=1700 hides platform UI, dock, and any baked-in captions.

After the A batch uploads, the runner asks once whether to render alternates. Pick any subset:

layout	bottom framing	use when
A	screen cropped + center-fit	default. Face reads clearly; screen compressed.
B	screen center-cropped to 9:16 directly	pure screen-content clips where face isn't useful.
C	full screen fit + blurred duplicate background	source has no clean face PiP, or screen content is cropped too aggressively in A.
D	face-top unchanged · bottom = full-screen floating card on blurred duplicate	artistic variant — full screen visible, no content lost, more depth.

Alternates land in {session}-portrait-ff/alternates/layout-{b,c,d}/. Share links on the A originals stay stable.

Speaker-aware captions

Multi-speaker clips get color-coded captions per speaker, optionally with a matching font. Built on pyannote diarization over the source audio.

name	colour	hex
jordaaan	yellow	`#f2d21b`
colin	orange	`#ff8c00`
steven	green	`#00e676`

Verified mappings cache to {session}.speakers.json so re-renders skip the checkpoint. Single-speaker sources bypass diarization entirely and render with the default foreign-film yellow.

Multi-speaker podcast processing

HICAM-style podcasts ship as multi-ISO recordings — one dedicated microphone track per speaker plus multiple camera angles. Each ISO has its own signal characteristics (mic placement, gain staging, room noise). Quality is not constant across ISOs, and it differs speaker-to-speaker because each person has a different mic on their voice.

Before anything renders, the pipeline scores each ISO and picks the best source per speaker. Wrong ISO choice = muddy captions, mis-transcriptions, wrong color attribution. Audio quality is gating.

Audio — ISO quality step

scripts/hicam-iso-quality.py ingests all staged ISO WAVs for a session and grades each one on ffmpeg-measured signal:

metric	from	what it tells you
mean_dbfs	volumedetect	Overall loudness. Below `-60 dB` = effectively silent.
peak_dbfs	volumedetect	Headroom. Values near `0 dB` suggest clipping.
silence_ratio	silencedetect	Fraction of duration below `-40 dB`. >0.95 = a dead track.
noise_floor_dbfs	astats	Residual room/hiss. Combined with peak gives dynamic range.
dynamic_range_db	astats	Wider = more speech dynamics; narrower = compressed/room-only.
grade	rubric	A / B / C / F — rolls the above into a usable flag.

The existing HICAM notes confirmed what the grader caught automatically: program-audio is silent at -91 dB; cam1-audio (camera room mic) is the only usable local track. Catch this in seconds, not after a failed transcription run.

Per-speaker ISO routing

On well-recorded sessions, each speaker has their own lavalier or hand mic. That mic is the correct ISO for that speaker — not the program mix, not the room mic. The pipeline builds a speaker_id → iso_path map after grading:

Visual — active-speaker crop

Parallel to ISO routing, layout A's top region follows whoever is speaking right now. Two paths:

Combined verification checkpoint

Speaker identity drives three things at once: caption colour, caption font (optional), and top-crop source. The user verifies the mapping in one checkpoint, not three:

Once confirmed, colour/font/crop are locked. The cache in {session}.speakers.json skips this on re-renders.

Known pitfalls on multi-ISO sources

Queueing multiple sources

Drop multiple longform links in any turn. Each gets its own project folder under Clips/ and runs through the full kickoff flow independently. The runner chains them so downloads + transcriptions overlap where possible — next source's download starts while the current source is rendering.

Dashboard — pipeline in three dimensions

A live three.js dashboard (pipeline-viz/) renders every stage of the pipeline as a lit platform in a dark scene. A clip's first frame rides along curved tubes between platforms in real time — the same JPEG you'd see if you opened the MP4 at frame 45. Runs 24/7 on the claw worker; reach it from any machine on Tailscale.

Each platform shows the tool driving that stage (yt-dlp, ffmpeg, whisper, PIL, moviepy, gws) and pulses its ring color when an event fires. ApertureDB is offset as a sidecar — it's where every clip is cataloged, where dedupe queries run, and where Drive share links are backfilled after upload.

What triggers a billboard

Services (all on claw)

file landing	event	edge the billboard travels
`workspace/source/*.mp4`	`source_acquired`	source → select
`workspace/moments/*.json`	`moments_selected`	select → extract
`workspace/_transcripts/*.json`	`transcribe_complete`	extract → transcribe
`workspace/clips/{session}-portrait-ff/*-ff.mp4`	`composite_complete`	captions → composite
`workspace/clips/{session}-portrait-ff/*-ff.replaced`	`drive_upload_complete`	composite → drive
(catalog POST after composite)	`catalog_write`	composite → apertureDb
(backfill POST after drive)	`catalog_write`	drive → apertureDb

service	port	role
`vite`	`5173`	serves the three.js dashboard
`ws.ts` (tsx)	`8787`	chokidar watchers + /thumb endpoint + WebSocket broadcast
`catalog.py` (docker)	`48788` (tunneled)	HTTP bridge to ApertureDB
`aperturedb-community` (docker)	`45555` (tunneled)	vector + metadata catalog
`tunnel.sh` (launchd)	—	SSH forward colima-VM ports → claw host
`youtube_poster.py` (launchd, 60s tick)	—	drains the ScheduledPost queue

Colima's lima guestagent does not auto-forward docker-published ports on this box, so services that need the DB talk to it through the tunnel.sh SSH forward (claw:48788 → VM:8788 → catalog container). Co-located workers on claw read/write through this path; remote clients only see the three.js dashboard.

Catalog schema

Naming scheme. Every clip gets a clip_code like CAPI-03a — two-to-four-letter batch tag, zero-padded sequence, single variant letter. Easy to say aloud, easy to grep, natural ordering. Inspired by the user's long-running Content Matrix (Hook# / Script# / CTA Letter) without inheriting its full column set.

Posting — YouTube auto, Meta assist

After a clip lands in Drive with a stable webViewLink, posting happens in one of two modes. YouTube is fully automated via the Data API; Meta is deliberately human-in-the-loop to dodge the account-suspension risk that comes with pure-API Instagram posting.

Queueing a post

YouTube poster loop

platform	mode	mechanism	why
YouTube	auto	Data API v3 `videos.insert` (resumable)	Quota-friendly, reliable, no inauthenticity flags
Meta (Reels/Feed)	assist	dashboard prepares caption + hashtags + thumbnail, copies to clipboard, deep-links `instagram://camera`	Pure-API posting flags business accounts; IG Reels API is missing stickers/music/polls anyway
TikTok	historical	(none — Content Posting API is approval-gated)	Past URLs stay in the catalog for insights; no new direct posting

Quota. videos.insert costs ~1600 units; default daily quota is 10 000, so ≈6 uploads/day before you need to request an increase. Stagger scheduled_at across the day to avoid burn.

Credentials

Copy pipeline-viz/server/credentials.example.env to credentials.env (gitignored) and fill in:

YouTube OAuth mint-a-refresh-token flow: enable Data API v3 + YouTube Analytics API in Google Cloud Console, create an OAuth client (Desktop type), run gws auth youtube once, grant scopes, paste the refresh token into credentials.env. The analytics scope requires the channel owner's consent.

Overview

Why this matters

Kickoff — drop a link, walk away

Timings — fire and forget

Pipeline

Source — acquire the long-form

Select — pick the moments

a · transcript + LLM (fastest)

b · best-clips skill

c · manual list

Extract — cut portrait clips direct from source

Transcribe — word-level timing

Burn-in v1 — the landscape master caption

Portrait — face top, screen below

Captions — the foreign-film look

Drive — upload & link-stable replace

Run it

First pass

Iterate — better model, same links

Live replace (parallel)

Quality levers

Layouts — A default, alternates on demand

Speaker-aware captions

Multi-speaker podcast processing

Audio — ISO quality step

Per-speaker ISO routing

Visual — active-speaker crop

Combined verification checkpoint

Known pitfalls on multi-ISO sources

Queueing multiple sources

Dashboard — pipeline in three dimensions

What triggers a billboard

Services (all on claw)

Catalog schema

Posting — YouTube auto, Meta assist

Queueing a post

YouTube poster loop

Credentials

Pitfalls

Appendix — file layout