Skill Detail

Scrapy Pipeline Data Extractor

Builds production Scrapy spiders with custom Item Pipelines for data cleaning and storage. Uses scrapy.linkextractors.LinkExtractor for crawl scoping and ItemLoader with MapCompose processors for field normalization.

Research & ScrapingGemini

Research & Scraping Gemini Security Reviewed

Tool match: scrapy ⭐ 61.3k GitHub stars BSD-3-Clause license

INSTALL WITH ANY AGENT

npx skills add agentskillexchange/skills --skill scrapy-pipeline-data-extractor Copy

Works best when you want a reusable capability, not another fragile one-off prompt.

View source

At a glance

Author

scrapy

Last updated

Mar 24, 2026

Quick brief

This skill creates production-ready web scraping pipelines using the Scrapy framework. It builds spiders with proper crawl scoping using scrapy.linkextractors.LinkExtractor with allow/deny regex patterns, and handles pagination through CrawlSpider rules or manual next-page following.

How it works

What this skill actually does

Data extraction uses ItemLoader with input/output processors like MapCompose and TakeFirst for field normalization, handling common tasks like whitespace stripping, price parsing, date normalization, and HTML tag removal. Custom Item Pipeline classes handle data validation, deduplication via fingerprinting, and storage to multiple backends including PostgreSQL, MongoDB, and JSON Lines.

The skill handles anti-scraping measures including rotating User-Agent headers via scrapy-fake-useragent, request throttling with AutoThrottle middleware, and proxy rotation. It supports JavaScript-rendered pages via scrapy-playwright integration for SPAs that require browser rendering.

Best fit

When to reach for it

Best when the job fits Research & Scraping.
Works naturally with Gemini setups.

Trust & provenance

Why this listing is credible

Built around the scrapy toolchain.
Trust status: Security Reviewed.
61.3k GitHub stars on the linked upstream source.
License: BSD-3-Clause.
Last updated Mar 24, 2026.

View source ↗