Skill Detail

Scrapy Pipeline Data Extractor

Builds production Scrapy spiders with custom Item Pipelines for data cleaning and storage. Uses scrapy.linkextractors.LinkExtractor for crawl scoping and ItemLoader with MapCompose processors for field normalization.

Research & ScrapingGemini
Research & Scraping Gemini Security Reviewed
Tool match: scrapy โญ 61.3k GitHub stars BSD-3-Clause license
INSTALL WITH ANY AGENT
npx skills add agentskillexchange/skills --skill scrapy-pipeline-data-extractor Copy
Works best when you want a reusable capability, not another fragile one-off prompt.
At a glance
Author
scrapy
Last updated
Mar 24, 2026
Quick brief

This skill creates production-ready web scraping pipelines using the Scrapy framework. It builds spiders with proper crawl scoping using scrapy.linkextractors.LinkExtractor with allow/deny regex patterns, and handles pagination through CrawlSpider rules or manual next-page following.

How it works

What this skill actually does

Data extraction uses ItemLoader with input/output processors like MapCompose and TakeFirst for field normalization, handling common tasks like whitespace stripping, price parsing, date normalization, and HTML tag removal. Custom Item Pipeline classes handle data validation, deduplication via fingerprinting, and storage to multiple backends including PostgreSQL, MongoDB, and JSON Lines.

The skill handles anti-scraping measures including rotating User-Agent headers via scrapy-fake-useragent, request throttling with AutoThrottle middleware, and proxy rotation. It supports JavaScript-rendered pages via scrapy-playwright integration for SPAs that require browser rendering.