Marketplace category archive

Data Extraction & Transformation Skills

Explore live Data Extraction & Transformation skills across the current marketplace catalog.

141live listings
10frameworks in use
Livetaxonomy archive

Category Skills

Browse the published marketplace skills currently assigned to this category.

Data Extraction & Transformation Security Reviewed

Diff nested JSON, API responses, and config snapshots before approving changes

Uses DeepDiff to compare structured objects deeply and return precise additions, removals, value changes, and deltas instead of noisy line-based diffs. Best when an agent is validating API payloads, configuration snapshots, or migration outputs where nesting and key paths matter.

Multi-Framework Data Extraction & Transformation
Yesterday 👁 1 View skill →
Data Extraction & Transformation Security Reviewed

Normalize raw CLI output into JSON for reliable downstream parsing and automation

Uses jc to turn command output and supported file formats into structured JSON so an agent can filter, diff, validate, and store results without brittle regex parsing. Best when a workflow already depends on standard CLI tools but needs machine-readable output for the next step.

Multi-Framework Data Extraction & Transformation
Yesterday 👁 13 View skill →
Data Extraction & Transformation Security Reviewed

Compare recurring CSV, TSV, or JSON exports and emit row-level change sets before syncs

Use csv-diff when an agent needs to explain what changed between two structured exports, not just that the files differ. The agent lines records up by a stable key, reports added, removed, and changed rows, and can hand the result to humans or downstream automations as readable text or machine-friendly JSON.

Multi-Framework Data Extraction & Transformation
Yesterday 👁 2 View skill →
Data Extraction & Transformation Security Reviewed

Convert HTML emails and web fragments into clean plain text for downstream agents

Use html-to-text when an agent receives raw HTML from inboxes, support systems, or scraped pages and needs readable plain text before classification, summarization, or indexing. The skill is deliberately bounded to deterministic HTML-to-text conversion, not crawling or summarization.

Multi-Framework Data Extraction & Transformation
2 days ago 👁 1 View skill →
Data Extraction & Transformation Security Reviewed

Create, repair, and recalculate spreadsheet workbooks without breaking formulas

Use the Anthropic xlsx skill when an agent needs to create, clean up, or modify .xlsx, .xlsm, .csv, or .tsv files as spreadsheet deliverables, not just inspect tabular data. It pushes the agent toward formula-safe edits, workbook validation, and recalculation instead of hardcoded outputs or one-off scripts.

Claude Agents Data Extraction & Transformation
2 days ago 👁 1 View skill →
Data Extraction & Transformation Security Reviewed

Extract invoice fields from vendor PDFs into structured records

Uses invoice2data to turn invoice PDFs into structured JSON, CSV, or XML using supplier-specific templates. This is for repeatable invoice field extraction and renaming workflows, not for full accounting system automation or generic OCR catalog listings.

Multi-Framework Data Extraction & Transformation
2 days ago 👁 3 View skill →
Data Extraction & Transformation Security Reviewed

Convert DOCX documents into clean HTML for publishing workflows with Mammoth

Use Mammoth when an agent needs to turn a .docx file into simple HTML that preserves semantic structure instead of Word-specific styling. This is for ingestion and publishing workflows, not for full document editing or perfect visual fidelity.

Multi-Framework Data Extraction & Transformation
3 days ago 👁 2 View skill →
Data Extraction & Transformation Security Reviewed

Unstructured Document ETL Toolkit

Unstructured is an open source document ETL toolkit for converting PDFs, HTML, emails, and office files into structured data. This skill covers how to use the real Unstructured project for partitioning documents, normalizing content, and feeding downstream agent or RAG pipelines.

Multi-Framework Data Extraction & Transformation
4 days ago 👁 2 View skill →
Data Extraction & Transformation Security Reviewed

Paperless-ngx Document OCR and Archive Management System

Paperless-ngx is an open source document management system that turns scanned or uploaded paperwork into a searchable archive. It combines OCR-driven ingestion, indexing, tagging, storage, and retrieval for teams that need structured access to documents.

Multi-Framework Data Extraction & Transformation
5 days ago 👁 2 View skill →
Data Extraction & Transformation Security Reviewed

Unstructured Document ETL for LLM Pipelines

Unstructured is an open source document processing library that converts PDFs, HTML, Office files, emails, and other formats into structured data for downstream AI workflows. It is a practical intake layer for extraction, chunking, and preprocessing before embeddings, search, or agent use.

Multi-Framework Data Extraction & Transformation
6 days ago 👁 2 View skill →
Data Extraction & Transformation Security Reviewed

Docling Document Parsing and Conversion Toolkit

Docling is an open source document processing toolkit that converts PDFs, Office files, HTML, images, audio, and more into structured outputs for AI workflows. It supports local execution, OCR, and integrations with agent frameworks and retrieval pipelines.

Multi-Framework Data Extraction & Transformation
6 days ago 👁 2 View skill →
Data Extraction & Transformation Security Reviewed

Docling Document Conversion and Extraction Toolkit

Docling is an open source document processing toolkit from the Docling project that converts PDFs, Office files, HTML, and other formats into structured output for downstream AI and automation workflows. It is well documented, actively maintained, and published as a Python package with a live docs site.

Multi-Framework Data Extraction & Transformation
6 days ago 👁 2 View skill →
Data Extraction & Transformation Security Reviewed

Metabase Open Source Business Intelligence and Embedded Analytics

Metabase is an open source business intelligence platform for querying data, building dashboards, and embedding analytics. It gives agents a real analytics surface for answering operational questions, creating dashboards, and wiring self-service reporting to databases or warehouse backends.

Multi-Framework Data Extraction & Transformation
6 days ago 👁 5 View skill →
Data Extraction & Transformation Security Reviewed

Docling Document Parsing and Conversion

Docling is an open-source document processing toolkit for turning PDFs and other files into structured outputs for AI systems. It handles advanced PDF understanding, OCR, multiple export formats, and integrations with agent and retrieval frameworks.

Multi-Framework Data Extraction & Transformation
1w ago 👁 2 View skill →
Data Extraction & Transformation Security Reviewed

Apache Superset Dashboard and SQL Exploration Skill

Apache Superset is a widely adopted open-source BI platform for SQL exploration, chart building, and dashboard delivery. This skill is useful when an agent needs to query warehouse data, assemble dashboards, or explain metrics using a mature analytics interface instead of ad hoc notebook code.

Multi-Framework Data Extraction & Transformation
1w ago 👁 3 View skill →
Data Extraction & Transformation Security Reviewed

Cheerio HTML and XML Parsing Library for Node.js Extraction Workflows

Cheerio is a long-running Node.js library for parsing and manipulating HTML and XML with a jQuery-like API. It is widely used in scraping, extraction, and content transformation pipelines where developers need fast server-side DOM traversal without a browser runtime.

Multi-Framework Data Extraction & Transformation
1w ago 👁 1 View skill →
Data Extraction & Transformation Security Reviewed

LightRAG Graph-Based Retrieval-Augmented Generation Framework

LightRAG is a Python-based retrieval-augmented generation framework that builds knowledge graphs from documents for more connected, contextual retrieval. Published at EMNLP 2025, it enables graph-powered RAG with support for multiple storage backends and LLM providers.

Multi-Framework Data Extraction & Transformation
1w ago 👁 2 View skill →
Data Extraction & Transformation Security Reviewed

Jina Reader URL-to-Markdown Converter and Web Search API

Jina Reader converts any URL to LLM-friendly markdown by prefixing https://r.jina.ai/ to any web address. It also provides a search endpoint at https://s.jina.ai/ that returns web search results in clean markdown format for RAG and agent workflows.

Multi-Framework Data Extraction & Transformation
1w ago 👁 4 View skill →
Data Extraction & Transformation Security Reviewed

pdfplumber Python PDF Text and Table Extraction Library

pdfplumber is a Python library for extracting detailed information from PDFs — text, tables, lines, rectangles, and curves — with visual debugging support. Built on pdfminer.six, it excels at structured table extraction from machine-generated PDFs and includes both a Python API and CLI.

Custom Agents Data Extraction & Transformation
1w ago 👁 3 View skill →
Data Extraction & Transformation Security Reviewed

WeasyPrint HTML and CSS to PDF Document Generator

WeasyPrint is a Python library by Kozea/CourtBouillon that converts HTML and CSS into PDF documents. It implements a CSS layout engine designed specifically for pagination, supporting web standards for printing including page breaks, headers, page counters, and responsive layouts without relying on a browser engine like WebKit or Gecko.

Custom Agents Data Extraction & Transformation
1w ago 👁 3 View skill →
Data Extraction & Transformation Security Reviewed

Gorse AI-Powered Open Source Recommender System Engine

Gorse is an AI-powered open-source recommender system written in Go that generates personalized recommendations via collaborative filtering, item-to-item similarity, and LLM-based ranking. It provides RESTful APIs and a GUI dashboard for recommendation pipeline editing, system monitoring, and data management.

Custom Agents Data Extraction & Transformation
1w ago 👁 1 View skill →
Data Extraction & Transformation Security Reviewed

rehype Plugin-Based HTML Processor by the Unified Collective

rehype is a plugin-based HTML processing toolkit built on the unified ecosystem. It parses HTML into an abstract syntax tree, transforms it with composable plugins, and serializes it back — enabling programmatic HTML minification, sanitization, link rewriting, heading extraction, and content manipulation at scale.

Multi-Framework Data Extraction & Transformation
1w ago 👁 2 View skill →
Data Extraction & Transformation Security Reviewed

markdownify Python HTML to Markdown Conversion Library

markdownify is a Python library that converts HTML content to clean Markdown text. It supports tag filtering, heading styles, custom converters, and code language detection, making it essential for content extraction and document transformation pipelines.

Multi-Framework Data Extraction & Transformation
1w ago 👁 3 View skill →