Skill Detail

Scrapy Spider Data Pipeline

Builds and manages Scrapy web scraping spiders with custom item pipelines. Supports Splash rendering for JavaScript pages, rotating proxies via scrapy-rotating-proxies, and export to MongoDB or Elasticsearch.

Data Extraction & TransformationCursor

Data Extraction & Transformation Cursor Security Reviewed

Tool match: scrapy ⭐ 61.3k GitHub stars BSD-3-Clause license

INSTALL WITH ANY AGENT

npx skills add agentskillexchange/skills --skill scrapy-spider-data-pipeline Copy

Works best when you want a reusable capability, not another fragile one-off prompt.

View source

At a glance

Author

scrapy

Last updated

Mar 24, 2026

Quick brief

The Scrapy Spider Data Pipeline creates and manages web scraping workflows using the Scrapy framework with advanced middleware and pipeline configurations. It generates spider classes with CSS and XPath selectors, configures request scheduling, and manages data export pipelines.

How it works

What this skill actually does

JavaScript-rendered pages are handled through Splash integration via scrapy-splash middleware, sending Lua scripts to the Splash HTTP API for page interaction including clicking, scrolling, and waiting for dynamic content. For headless browser needs, Playwright integration via scrapy-playwright provides full browser automation capabilities.

Anti-blocking measures include rotating proxy configuration through scrapy-rotating-proxies with proxy health checking, user agent rotation via scrapy-fake-useragent, and request fingerprinting with scrapy-crawlera for intelligent rate limiting. Retry middleware handles various HTTP error codes and connection timeouts.

Item pipelines support data validation with Scrapy ItemLoaders and input/output processors, deduplication via bloom filters, and export to multiple backends including MongoDB (pymongo), Elasticsearch (elasticsearch-py), PostgreSQL, and JSON Lines files. The skill configures Scrapy settings for concurrent requests, download delays, depth limits, and AutoThrottle for polite crawling. Feed exports support S3 and GCS cloud storage destinations.

Best fit

When to reach for it

Best when the job fits Data Extraction & Transformation.
Works naturally with Cursor setups.

Trust & provenance

Why this listing is credible

Built around the scrapy toolchain.
Trust status: Security Reviewed.
61.3k GitHub stars on the linked upstream source.
License: BSD-3-Clause.
Last updated Mar 24, 2026.

View source ↗