Skill Detail

Scrapy Distributed Crawler Framework

Orchestrates large-scale web crawling using Scrapy with scrapy-redis for distributed job queuing. Integrates Splash for JavaScript rendering, stores results in MongoDB via scrapy-mongodb pipeline, and respects robots.txt with AutoThrottle.

Research & ScrapingMCP

Research & Scraping MCP Security Reviewed

Tool match: scrapy ⭐ 61.3k GitHub stars BSD-3-Clause license

INSTALL WITH ANY AGENT

npx skills add agentskillexchange/skills --skill scrapy-distributed-crawler-framework Copy

Works best when you want a reusable capability, not another fragile one-off prompt.

View source

At a glance

Author

scrapy

Last updated

Mar 24, 2026

Quick brief

The Scrapy Distributed Crawler Framework enables scalable web data collection across multiple crawler instances. Built on Scrapy framework with scrapy-redis extension, it distributes URL frontier management across Redis queues enabling horizontal scaling of crawler workers without duplicate URL processing.

How it works

What this skill actually does

JavaScript-rendered content is handled through Splash integration via scrapy-splash middleware, providing a lightweight alternative to full browser automation for pages requiring JS execution. The framework respects crawl ethics with built-in robots.txt compliance, AutoThrottle extension for adaptive request rate management, and configurable politeness delays per domain.

Data flows through configurable item pipelines including validation, deduplication via MinHash fingerprinting, and storage to MongoDB through the scrapy-mongodb pipeline extension. Media files are downloaded via the built-in FilesPipeline with S3 backend storage support. Monitoring uses Scrapy stats collection exported to Prometheus via a custom StatsD exporter, with Grafana dashboards for real-time crawl progress visualization. Spider contracts provide automated testing for extraction logic.

Best fit

When to reach for it

Best when the job fits Research & Scraping.
Works naturally with MCP setups.

Trust & provenance

Why this listing is credible

Built around the scrapy toolchain.
Trust status: Security Reviewed.
61.3k GitHub stars on the linked upstream source.
License: BSD-3-Clause.
Last updated Mar 24, 2026.

View source ↗