Skill Detail

Force-align narration and transcript text into subtitle or SMIL timing maps

Use aeneas when an agent already has audio and text, but still needs timing. The workflow aligns spoken narration against fragments of plain text or XML and emits sync maps that can be turned into subtitles, EPUB 3 media overlays, JSON timing data, or other downstream caption assets.

Media & TranscriptionMulti-Framework

Media & Transcription Multi-Framework Security Reviewed

⭐ 2.8k GitHub stars

INSTALL WITH ANY AGENT

npx skills add agentskillexchange/skills --skill force-align-narration-and-transcript-text-into-subtitle-or-smil-timing-maps

Copy

Works best when you want a reusable capability, not another fragile one-off prompt.

View source Documentation

At a glance

Tools required

Python, pip, FFmpeg, and eSpeak

Install & setup

Install FFmpeg and eSpeak, then run: pip install numpy && pip install aeneas

Author

Alberto Pettarin

Publisher

Open Source Project

Last updated

Apr 12, 2026

Quick brief

This ASE entry is built around aeneas, the readbeyond/aeneas project for forced alignment between audio and text. The agent job-to-be-done is clear: take a narration file and its transcript or marked-up source text, align text fragments to the spoken audio, and emit a timing map that downstream systems can use for subtitles, WebVTT, SRT, TTML, JSON, SMIL, CSV, or EPUB media overlays. That is a bounded operational workflow for agents that prepare accessible media or synchronized reading experiences.

How it works

What this skill actually does

Invoke this skill when the user already has both the words and the audio, but the missing piece is time alignment. It fits audiobook chapter prep, language-learning material, accessibility overlays, narrated documentation, podcast chapterization experiments, and subtitle generation pipelines where the transcript exists but timestamps do not. This is not the right tool when the user needs speech recognition from scratch, video editing, or a human subtitle editing workstation. It is specifically for forced alignment, where the text is known and the agent must compute the timing map automatically.

The scope boundary is what keeps the entry skill-shaped. aeneas is not a generic audio platform, CMS, or desktop editor listing. The workflow starts with source audio plus source text and ends with synchronized timing artifacts that another tool can package, render, or review. Integration points include Python scripts, FFmpeg-based preprocessing, eSpeak-backed alignment setups, EPUB production pipelines, subtitle QA loops, and localization workflows that need machine-generated first-pass timings before manual polish. Because the upstream project publishes a real repository, docs, license, release tags, and installation guidance, it clears the evidence gate comfortably.

Best fit

When to reach for it

Best when the job fits Media & Transcription.
Works naturally with Multi-Framework setups.
Requires Python, pip, FFmpeg, and eSpeak.
Installation is straightforward: Install FFmpeg and eSpeak, then run: pip install numpy && pip install aeneas

Trust & provenance

Why this listing is credible

Trust status: Security Reviewed.
2.8k GitHub stars on the linked upstream source.
Last updated Apr 12, 2026.

View source ↗ Documentation ↗