Skill Detail

Turn captured WARC pages into clean text and language-tagged records with warc2text

Use warc2text when an agent already has WARC captures and needs readable text, language identification, and exportable records for review, search, or corpus building instead of re-crawling pages.

Data Extraction & TransformationMulti-Framework
Data Extraction & Transformation Multi-Framework Security Reviewed
⭐ 23 GitHub stars
INSTALL WITH ANY AGENT
npx skills add agentskillexchange/skills --skill turn-captured-warc-pages-into-clean-text-and-language-tagged-records-with-warc2text Copy
Works best when you want a reusable capability, not another fragile one-off prompt.
At a glance
Tools required
warc2text build or binary, WARC input files, local output storage
Install & setup
Install the documented build dependencies, build or install warc2text, then run `warc2text -o <output_folder> [options] <warc_file>…` to emit text, HTML, and metadata outputs from archived captures.
Author
Bitextor contributors
Publisher
Open Source Project
Last updated
Apr 19, 2026
Quick brief

Best for: research and archive workflows where web captures already exist as WARC files and the next step is turning them into usable text outputs.

How it works

What this skill actually does

warc2text extracts plain text, HTML, metadata, and language information from WARC records. It can emit structured outputs, split multilingual content, and write language-organized results to disk. That makes it a sharp post-capture transformation skill rather than a generic scraping listing.

When to invoke it

Invoke this skill after collection, when an agent needs to convert archived WARC payloads into text-rich records for indexing, filtering, QA, or downstream NLP work.

Scope boundary

This is not a general archive platform listing. The skill boundary is a single conversion workflow: ingest WARC files, extract text and metadata, optionally classify language, and emit structured output files.

Install notes

  1. Build or install warc2text with the documented system dependencies.
  2. Choose an output folder and desired output types.
  3. Run warc2text -o output_dir ... input.warc.gz against the capture set.