Skill Detail

Turn captured WARC pages into clean text and language-tagged records with warc2text

Use warc2text when an agent already has WARC captures and needs readable text, language identification, and exportable records for review, search, or corpus building instead of re-crawling pages.

Data Extraction & TransformationMulti-Framework

Data Extraction & Transformation Multi-Framework Security Reviewed

⭐ 23 GitHub stars

INSTALL WITH ANY AGENT

npx skills add agentskillexchange/skills --skill turn-captured-warc-pages-into-clean-text-and-language-tagged-records-with-warc2text

Copy

Works best when you want a reusable capability, not another fragile one-off prompt.

View source

At a glance

Tools required

warc2text build or binary, WARC input files, local output storage

Install & setup

Install the documented build dependencies, build or install warc2text, then run `warc2text -o <output_folder> [options] <warc_file>…` to emit text, HTML, and metadata outputs from archived captures.

Author

Bitextor contributors

Publisher

Open Source Project

Last updated

Apr 19, 2026

Quick brief

Best for: research and archive workflows where web captures already exist as WARC files and the next step is turning them into usable text outputs.

How it works

What this skill actually does

warc2text extracts plain text, HTML, metadata, and language information from WARC records. It can emit structured outputs, split multilingual content, and write language-organized results to disk. That makes it a sharp post-capture transformation skill rather than a generic scraping listing.

When to invoke it

Invoke this skill after collection, when an agent needs to convert archived WARC payloads into text-rich records for indexing, filtering, QA, or downstream NLP work.

Scope boundary

This is not a general archive platform listing. The skill boundary is a single conversion workflow: ingest WARC files, extract text and metadata, optionally classify language, and emit structured output files.

Install notes

Build or install warc2text with the documented system dependencies.
Choose an output folder and desired output types.
Run warc2text -o output_dir ... input.warc.gz against the capture set.

Best fit

When to reach for it

Best when the job fits Data Extraction & Transformation.
Works naturally with Multi-Framework setups.
Requires warc2text build or binary, WARC input files, local output storage.
Installation is straightforward: Install the documented build dependencies, build or install warc2text, then run `warc2text -o [options] …` to…

Trust & provenance

Why this listing is credible

Trust status: Security Reviewed.
23 GitHub stars on the linked upstream source.
Last updated Apr 19, 2026.

View source ↗