Skill Detail

Scrapy Pipeline Manager

Manages Scrapy spider deployments via Scrapyd API with custom item pipelines for MongoDB ingestion, deduplication via MinHash LSH, and rotating proxy middleware configuration.

Research & ScrapingClaude Code
Research & Scraping Claude Code Security Reviewed
Tool match: scrapy โญ 61.3k GitHub stars BSD-3-Clause license
INSTALL WITH ANY AGENT
npx skills add agentskillexchange/skills --skill scrapy-pipeline-manager Copy
Works best when you want a reusable capability, not another fragile one-off prompt.
At a glance
Author
scrapy
Last updated
Mar 24, 2026
Quick brief

The Scrapy Pipeline Manager skill orchestrates Scrapy spider deployments through the Scrapyd HTTP API. It handles egg packaging, project deployment, spider scheduling, and log retrieval across multiple Scrapyd nodes for distributed crawling.

How it works

What this skill actually does

Custom item pipelines are configured for downstream data processing including MongoDB ingestion via PyMongo with automatic collection sharding, Elasticsearch indexing via the bulk API, and file download pipelines for media assets. Deduplication uses MinHash LSH (Locality Sensitive Hashing) via the datasketch library for near-duplicate detection across crawl runs.

The middleware stack includes rotating proxy support via scrapy-rotating-proxies with dead proxy detection, custom retry middleware with exponential backoff, and AutoThrottle configuration for polite crawling. The skill manages robots.txt compliance, generates crawl statistics dashboards, and supports Splash integration for JavaScript rendering through the scrapy-splash middleware.