Skill Detail

Common Crawl Index Query Agent

Queries the Common Crawl Index API for large-scale web archive research and data extraction. Uses the CDX Server API, WARC record parsing with warcio, and the Common Crawl S3 bucket for bulk data access.

Research & ScrapingOpenClaw

Research & Scraping OpenClaw Security Reviewed

INSTALL WITH ANY AGENT

npx skills add agentskillexchange/skills --skill common-crawl-index-query-agent Copy

Works best when you want a reusable capability, not another fragile one-off prompt.

View source

At a glance

Author

Common Crawl Foundation

Last updated

Mar 24, 2026

Quick brief

The Common Crawl Index Query Agent enables large-scale web research by querying the Common Crawl CDX Server API for historical web page snapshots across petabytes of archived web data. It constructs efficient index queries using URL prefix matching, domain filtering, and MIME type constraints to locate specific WARC records.

How it works

What this skill actually does

The agent retrieves WARC records from the Common Crawl S3 bucket (s3://commoncrawl) using byte-range requests based on offset and length values from CDX API responses. It parses WARC files using the warcio library, extracting HTTP headers and response bodies from warcrecord objects for content analysis.

Advanced features include parallel CDX pagination for high-volume queries across multiple crawl indices, content deduplication using simhash digest values from the CDX API, and temporal analysis by querying across multiple monthly crawl archives (CC-MAIN-YYYY-WW). The agent also integrates with the columnar index format for SQL-like queries via Amazon Athena, and implements robots.txt compliance checking using the urllib.robotparser module against archived robots.txt snapshots.

Best fit

When to reach for it

Best when the job fits Research & Scraping.
Works naturally with OpenClaw setups.

Trust & provenance

Why this listing is credible

Trust status: Security Reviewed.
Last updated Mar 24, 2026.

View source ↗