Data Extraction & Transformation
Published
<p>Use pdf-mcp to inspect a PDF, search it, and load only the pages that matter so an agent can answer questions from long documents without brute-forcing the whole file into context.</p>
MCP Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Uses DeepDiff to compare structured objects deeply and return precise additions, removals, value changes, and deltas instead of noisy line-based diffs. Best when an agent is validating API payloads, configuration snapshots, or migration outputs where nesting and key paths matter.
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Uses jc to turn command output and supported file formats into structured JSON so an agent can filter, diff, validate, and store results without brittle regex parsing. Best when a workflow already depends on standard CLI tools but needs machine-readable output for the next step.
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Use csv-diff when an agent needs to explain what changed between two structured exports, not just that the files differ. The agent lines records up by a stable key, reports added, removed, and changed rows, and can hand the result to humans or downstream automations as readable text or machine-friendly JSON.
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Use html-to-text when an agent receives raw HTML from inboxes, support systems, or scraped pages and needs readable plain text before classification, summarization, or indexing. The skill is deliberately bounded to deterministic HTML-to-text conversion, not crawling or summarization.
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Use the Anthropic xlsx skill when an agent needs to create, clean up, or modify .xlsx, .xlsm, .csv, or .tsv files as spreadsheet deliverables, not just inspect tabular data. It pushes the agent toward formula-safe edits, workbook validation, and recalculation instead of hardcoded outputs or one-off scripts.
Claude Agents Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Uses invoice2data to turn invoice PDFs into structured JSON, CSV, or XML using supplier-specific templates. This is for repeatable invoice field extraction and renaming workflows, not for full accounting system automation or generic OCR catalog listings.
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Use Mammoth when an agent needs to turn a .docx file into simple HTML that preserves semantic structure instead of Word-specific styling. This is for ingestion and publishing workflows, not for full document editing or perfect visual fidelity.
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Unstructured is an open source document ETL toolkit for converting PDFs, HTML, emails, and office files into structured data. This skill covers how to use the real Unstructured project for partitioning documents, normalizing content, and feeding downstream agent or RAG pipelines.
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Paperless-ngx is an open source document management system that turns scanned or uploaded paperwork into a searchable archive. It combines OCR-driven ingestion, indexing, tagging, storage, and retrieval for teams that need structured access to documents.
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Unstructured is an open source document processing library that converts PDFs, HTML, Office files, emails, and other formats into structured data for downstream AI workflows. It is a practical intake layer for extraction, chunking, and preprocessing before embeddings, search, or agent use.
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Docling is an open source document processing toolkit that converts PDFs, Office files, HTML, images, audio, and more into structured outputs for AI workflows. It supports local execution, OCR, and integrations with agent frameworks and retrieval pipelines.
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Docling is an open source document processing toolkit from the Docling project that converts PDFs, Office files, HTML, and other formats into structured output for downstream AI and automation workflows. It is well documented, actively maintained, and published as a Python package with a live docs site.
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Metabase is an open source business intelligence platform for querying data, building dashboards, and embedding analytics. It gives agents a real analytics surface for answering operational questions, creating dashboards, and wiring self-service reporting to databases or warehouse backends.
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Docling is an open-source document processing toolkit for turning PDFs and other files into structured outputs for AI systems. It handles advanced PDF understanding, OCR, multiple export formats, and integrations with agent and retrieval frameworks.
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Apache Superset is a widely adopted open-source BI platform for SQL exploration, chart building, and dashboard delivery. This skill is useful when an agent needs to query warehouse data, assemble dashboards, or explain metrics using a mature analytics interface instead of ad hoc notebook code.
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Cheerio is a long-running Node.js library for parsing and manipulating HTML and XML with a jQuery-like API. It is widely used in scraping, extraction, and content transformation pipelines where developers need fast server-side DOM traversal without a browser runtime.
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
LightRAG is a Python-based retrieval-augmented generation framework that builds knowledge graphs from documents for more connected, contextual retrieval. Published at EMNLP 2025, it enables graph-powered RAG with support for multiple storage backends and LLM providers.
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Jina Reader converts any URL to LLM-friendly markdown by prefixing https://r.jina.ai/ to any web address. It also provides a search endpoint at https://s.jina.ai/ that returns web search results in clean markdown format for RAG and agent workflows.
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
pdfplumber is a Python library for extracting detailed information from PDFs — text, tables, lines, rectangles, and curves — with visual debugging support. Built on pdfminer.six, it excels at structured table extraction from machine-generated PDFs and includes both a Python API and CLI.
Custom Agents Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
WeasyPrint is a Python library by Kozea/CourtBouillon that converts HTML and CSS into PDF documents. It implements a CSS layout engine designed specifically for pagination, supporting web standards for printing including page breaks, headers, page counters, and responsive layouts without relying on a browser engine like WebKit or Gecko.
Custom Agents Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Gorse is an AI-powered open-source recommender system written in Go that generates personalized recommendations via collaborative filtering, item-to-item similarity, and LLM-based ranking. It provides RESTful APIs and a GUI dashboard for recommendation pipeline editing, system monitoring, and data management.
Custom Agents Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
rehype is a plugin-based HTML processing toolkit built on the unified ecosystem. It parses HTML into an abstract syntax tree, transforms it with composable plugins, and serializes it back — enabling programmatic HTML minification, sanitization, link rewriting, heading extraction, and content manipulation at scale.
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
markdownify is a Python library that converts HTML content to clean Markdown text. It supports tag filtering, heading styles, custom converters, and code language detection, making it essential for content extraction and document transformation pipelines.
Multi-Framework Data Extraction & Transformation