Marketplace category archive

Data Extraction & Transformation Skills

Explore live Data Extraction & Transformation skills across the current marketplace catalog.

142live listings
10frameworks in use
Livetaxonomy archive

Category Skills

Browse the published marketplace skills currently assigned to this category.

Data Extraction & Transformation Security Reviewed

markdownify Python HTML to Markdown Conversion Library

markdownify is a Python library that converts HTML content to clean Markdown text. It supports tag filtering, heading styles, custom converters, and code language detection, making it essential for content extraction and document transformation pipelines.

Multi-Framework Data Extraction & Transformation
1w ago 👁 3 View skill →
Data Extraction & Transformation Security Reviewed

xan SIMD-Powered CSV Processing and Analysis CLI

xan is a high-performance command-line tool for processing CSV files, written in Rust with a novel SIMD CSV parser. It offers filtering, slicing, aggregation, sorting, joining, and visualization of CSV data, with its own expression language for complex transformations and support for adjacent data formats.

Multi-Framework Data Extraction & Transformation
1w ago 👁 3 View skill →
Data Extraction & Transformation Security Reviewed

Orama Embeddable Search Engine and RAG Pipeline for JavaScript

Orama is a full-text, vector, and hybrid search engine that runs in the browser, on a server, or at the edge in under 2KB. It provides built-in RAG pipeline support, typo tolerance, faceted search, and language-agnostic stemming — all without external dependencies.

Multi-Framework Data Extraction & Transformation
1w ago 👁 3 View skill →
Data Extraction & Transformation Security Reviewed

Gitingest Repository-to-Prompt Codebase Extraction Tool

Gitingest turns a Git repository into a prompt-friendly text bundle that agents and LLM workflows can inspect quickly. It can be used as a hosted URL pattern, a Python package, or a local server for extracting repository summaries, structure, and source content.

Multi-Framework Data Extraction & Transformation
1w ago 👁 5 View skill →
Data Extraction & Transformation Security Reviewed

Unstructured Document Partitioning and ETL Library for LLM Pipelines

Unstructured is an open-source library for ingesting and partitioning PDFs, HTML, Office documents, emails, and other unstructured inputs into structured elements and metadata. It is commonly used as a preprocessing layer for RAG, search, extraction, and downstream AI pipelines.

⭐ 14.4k unstructured Apache-2.0
Multi-Framework Data Extraction & Transformation
1w ago 👁 8 View skill →
Data Extraction & Transformation Security Reviewed

Teable No-Code Postgres Database Platform and Airtable Alternative

Teable is an open source no-code database platform built on PostgreSQL that uses a spreadsheet-like interface for creating powerful database applications. It supports real-time collaboration, scales to millions of rows, and provides a REST API for programmatic access.

⭐ 21.1k teable NOASSERTION
Multi-Framework Data Extraction & Transformation
2w ago 👁 4 View skill →
Data Extraction & Transformation Security Reviewed

MarkItDown Document-to-Markdown Converter by Microsoft

MarkItDown is a Python utility by Microsoft that converts PDF, Word, PowerPoint, Excel, images, audio, HTML, and other files into Markdown for LLM consumption. It preserves headings, lists, tables, and links while producing token-efficient output optimized for text analysis pipelines.

⭐ 93.2k markitdown MIT
Multi-Framework Data Extraction & Transformation
2w ago 👁 6 View skill →
Data Extraction & Transformation Security Reviewed

Grist Self-Hosted Relational Spreadsheet and Database Platform

Grist is an open-source modern relational spreadsheet that combines the flexibility of a spreadsheet with the robustness of a database. It supports Python formulas, a REST API, self-hosting via Docker, and AI-powered formula assistance.

⭐ 10.8k grist-core Apache-2.0
Multi-Framework Data Extraction & Transformation
2w ago 👁 2 View skill →
Data Extraction & Transformation Security Reviewed

xq Command-Line XML and HTML Beautifier and Content Extractor

xq is a command-line XML and HTML beautifier and content extractor written in Go. It provides syntax highlighting, automatic formatting, XPath and CSS selector queries, and JSON output conversion for XML and HTML documents.

⭐ 1.1k xq MIT
Multi-Framework Data Extraction & Transformation
2w ago 👁 3 View skill →
Data Extraction & Transformation Security Reviewed

htmlq Command-Line HTML Content Extractor with CSS Selectors

htmlq is a command-line tool for extracting content from HTML using CSS selectors, functioning as the HTML equivalent of jq. Written in Rust, it lets you pipe HTML through CSS selectors to extract text, attributes, and structured content directly from the terminal.

⭐ 7.5k htmlq MIT ⚠ unmaintained
Multi-Framework Data Extraction & Transformation
2w ago 👁 3 View skill →
Data Extraction & Transformation Security Reviewed

Newsboat Terminal RSS and Atom Feed Reader

Newsboat is an actively maintained RSS/Atom feed reader for the text console. A fork of the discontinued Newsbeuter, it provides a fast, keyboard-driven interface for subscribing to, reading, and managing feeds with powerful filtering, macro support, and scriptable automation.

⭐ 3.8k newsboat MIT
Custom Agents Data Extraction & Transformation
2w ago 👁 5 View skill →
Data Extraction & Transformation Security Reviewed

franc Natural Language Detection Library and CLI

franc is a JavaScript library and CLI tool for detecting the language of text. It supports up to 419 languages and returns ISO 639-3 codes, making it the most comprehensive open-source language detection tool available for Node.js.

⭐ 4.4k franc MIT ⚠ unmaintained
Multi-Framework Data Extraction & Transformation
2w ago 👁 2 View skill →
Data Extraction & Transformation Security Reviewed

Surya Document OCR with Layout Analysis and Table Recognition

Surya is a document OCR toolkit by Datalab that performs OCR in 90+ languages, line-level text detection, layout analysis, reading order detection, table recognition, and LaTeX OCR. It benchmarks favorably against cloud OCR services on a wide range of document types.

⭐ 19.5k surya GPL-3.0
Custom Agents Data Extraction & Transformation
2w ago 👁 3 View skill →
Data Extraction & Transformation Security Reviewed

PaddleOCR Multilingual Document OCR and Structured Data Toolkit

PaddleOCR is a powerful, lightweight OCR toolkit developed by Baidu that converts documents and images into structured, AI-friendly data like JSON and Markdown. It supports 100+ languages with industry-leading accuracy, bridging the gap between images/PDFs and LLMs.

⭐ 73.7k paddleocr Apache-2.0
Multi-Framework Data Extraction & Transformation
2w ago 👁 2 View skill →
Data Extraction & Transformation Security Reviewed

ExifTool Metadata Reader and Writer for Images and Files

ExifTool by Phil Harvey is a comprehensive Perl-based CLI tool for reading, writing, and editing metadata in over 400 file types. It extracts EXIF, IPTC, XMP, GPS, and maker note data from images, videos, audio, PDFs, and documents, making it the industry standard for metadata forensics and batch processing.

⭐ 4.6k exiftool
Multi-Framework Data Extraction & Transformation
2w ago 👁 4 View skill →
Data Extraction & Transformation Security Reviewed

gallery-dl Image Gallery and Collection Downloader

gallery-dl is a command-line tool for downloading image galleries and collections from dozens of hosting sites including Pixiv, DeviantArt, Twitter, Reddit, Instagram, and Danbooru. It supports authentication, metadata extraction, filtering, and configurable output templates.

⭐ 17.5k gallery-dl
Multi-Framework Data Extraction & Transformation
2w ago 👁 3 View skill →
Data Extraction & Transformation Security Reviewed

LangExtract LLM-Powered Structured Text Extraction

LangExtract by Google is a Python library for extracting structured information from unstructured text using LLMs with precise source grounding. With 35,000+ GitHub stars, it handles everything from clinical notes to literary analysis, producing verified extraction results with exact source text mappings and interactive visualizations.

⭐ 35k langextract
Custom Agents Data Extraction & Transformation
2w ago 👁 5 View skill →
Data Extraction & Transformation Security Reviewed

Maxun No-Code Web Data Extraction Platform

Maxun is an open-source no-code web data platform for turning any website into structured, reliable data. It supports extraction via recorder mode and LLM-powered natural language mode, plus crawling, scraping, and search capabilities. With 15,000+ GitHub stars and both SDK and CLI interfaces, it handles everything from simple page scrapes to complex automated workflows.

⭐ 15.3k maxun
Custom Agents Data Extraction & Transformation
2w ago 👁 4 View skill →
Data Extraction & Transformation Security Reviewed

Typesense Typo-Tolerant Search Engine

Typesense is an open-source, typo-tolerant search engine built in C++ for building fast, relevant search experiences. It serves as a self-hostable alternative to Algolia with support for vector search, geo-search, and faceted filtering.

⭐ 25.5k typesense
Custom Agents Data Extraction & Transformation
2w ago 👁 5 View skill →
Data Extraction & Transformation Security Reviewed

trdsql SQL Query Engine for CSV JSON and YAML Files

trdsql is a CLI tool that executes SQL queries directly on CSV, LTSV, JSON, YAML, and TBLN files. It supports PostgreSQL and MySQL syntax, can join data across multiple files and databases, and outputs results in various formats including JSON, Markdown, and vertical display.

⭐ 2.2k trdsql
Custom Agents Data Extraction & Transformation
2w ago 👁 2 View skill →
Data Extraction & Transformation Security Reviewed

jnv Interactive JSON Navigator and jq Filter Editor

A terminal-based interactive JSON viewer and jq filter editor written in Rust. jnv lets developers navigate complex JSON structures visually while building and testing jq queries in real time, with syntax highlighting, auto-completion, and clipboard support.

⭐ 6k jnv
Claude Code Data Extraction & Transformation
2w ago 👁 4 View skill →
Data Extraction & Transformation Security Reviewed

csvkit Python CSV Utility Suite

csvkit is a suite of Python command-line utilities for converting to, working with, and analyzing CSV files. It includes tools for format conversion, querying CSV with SQL, data cleaning, filtering, sorting, and statistical analysis.

⭐ 6.4k csvkit
Custom Agents Data Extraction & Transformation
2w ago 👁 4 View skill →
Data Extraction & Transformation Security Reviewed

Redpanda Connect Declarative Stream Processor

Redpanda Connect (formerly Benthos) is a high-performance stream processor that connects data sources and sinks through declarative YAML pipelines. It supports hundreds of connectors and a built-in mapping language called Bloblang for data transformation.

⭐ 8.6k connect
Custom Agents Data Extraction & Transformation
2w ago 👁 3 View skill →
Data Extraction & Transformation Security Reviewed

Datasette Data Exploration and Publishing Tool

Datasette is an open-source Python tool for exploring and publishing data. It turns any SQLite database into an interactive web interface with a JSON API, enabling data journalists, researchers, and developers to share datasets without writing application code.

⭐ 10.9k datasette
Custom Agents Data Extraction & Transformation
2w ago 👁 4 View skill →