Data Extraction & Transformation
Security Reviewed
markdownify is a Python library that converts HTML content to clean Markdown text. It supports tag filtering, heading styles, custom converters, and code language detection, making it essential for content extraction and document transformation pipelines.
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
xan is a high-performance command-line tool for processing CSV files, written in Rust with a novel SIMD CSV parser. It offers filtering, slicing, aggregation, sorting, joining, and visualization of CSV data, with its own expression language for complex transformations and support for adjacent data formats.
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Orama is a full-text, vector, and hybrid search engine that runs in the browser, on a server, or at the edge in under 2KB. It provides built-in RAG pipeline support, typo tolerance, faceted search, and language-agnostic stemming — all without external dependencies.
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Gitingest turns a Git repository into a prompt-friendly text bundle that agents and LLM workflows can inspect quickly. It can be used as a hosted URL pattern, a Python package, or a local server for extracting repository summaries, structure, and source content.
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Unstructured is an open-source library for ingesting and partitioning PDFs, HTML, Office documents, emails, and other unstructured inputs into structured elements and metadata. It is commonly used as a preprocessing layer for RAG, search, extraction, and downstream AI pipelines.
⭐ 14.4k unstructured Apache-2.0
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Teable is an open source no-code database platform built on PostgreSQL that uses a spreadsheet-like interface for creating powerful database applications. It supports real-time collaboration, scales to millions of rows, and provides a REST API for programmatic access.
⭐ 21.1k teable NOASSERTION
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
MarkItDown is a Python utility by Microsoft that converts PDF, Word, PowerPoint, Excel, images, audio, HTML, and other files into Markdown for LLM consumption. It preserves headings, lists, tables, and links while producing token-efficient output optimized for text analysis pipelines.
⭐ 93.2k markitdown MIT
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Grist is an open-source modern relational spreadsheet that combines the flexibility of a spreadsheet with the robustness of a database. It supports Python formulas, a REST API, self-hosting via Docker, and AI-powered formula assistance.
⭐ 10.8k grist-core Apache-2.0
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
xq is a command-line XML and HTML beautifier and content extractor written in Go. It provides syntax highlighting, automatic formatting, XPath and CSS selector queries, and JSON output conversion for XML and HTML documents.
⭐ 1.1k xq MIT
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
htmlq is a command-line tool for extracting content from HTML using CSS selectors, functioning as the HTML equivalent of jq. Written in Rust, it lets you pipe HTML through CSS selectors to extract text, attributes, and structured content directly from the terminal.
⭐ 7.5k htmlq MIT ⚠ unmaintained
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Newsboat is an actively maintained RSS/Atom feed reader for the text console. A fork of the discontinued Newsbeuter, it provides a fast, keyboard-driven interface for subscribing to, reading, and managing feeds with powerful filtering, macro support, and scriptable automation.
⭐ 3.8k newsboat MIT
Custom Agents Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
franc is a JavaScript library and CLI tool for detecting the language of text. It supports up to 419 languages and returns ISO 639-3 codes, making it the most comprehensive open-source language detection tool available for Node.js.
⭐ 4.4k franc MIT ⚠ unmaintained
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Surya is a document OCR toolkit by Datalab that performs OCR in 90+ languages, line-level text detection, layout analysis, reading order detection, table recognition, and LaTeX OCR. It benchmarks favorably against cloud OCR services on a wide range of document types.
⭐ 19.5k surya GPL-3.0
Custom Agents Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
PaddleOCR is a powerful, lightweight OCR toolkit developed by Baidu that converts documents and images into structured, AI-friendly data like JSON and Markdown. It supports 100+ languages with industry-leading accuracy, bridging the gap between images/PDFs and LLMs.
⭐ 73.7k paddleocr Apache-2.0
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
ExifTool by Phil Harvey is a comprehensive Perl-based CLI tool for reading, writing, and editing metadata in over 400 file types. It extracts EXIF, IPTC, XMP, GPS, and maker note data from images, videos, audio, PDFs, and documents, making it the industry standard for metadata forensics and batch processing.
⭐ 4.6k exiftool
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
gallery-dl is a command-line tool for downloading image galleries and collections from dozens of hosting sites including Pixiv, DeviantArt, Twitter, Reddit, Instagram, and Danbooru. It supports authentication, metadata extraction, filtering, and configurable output templates.
⭐ 17.5k gallery-dl
Multi-Framework Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
LangExtract by Google is a Python library for extracting structured information from unstructured text using LLMs with precise source grounding. With 35,000+ GitHub stars, it handles everything from clinical notes to literary analysis, producing verified extraction results with exact source text mappings and interactive visualizations.
⭐ 35k langextract
Custom Agents Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Maxun is an open-source no-code web data platform for turning any website into structured, reliable data. It supports extraction via recorder mode and LLM-powered natural language mode, plus crawling, scraping, and search capabilities. With 15,000+ GitHub stars and both SDK and CLI interfaces, it handles everything from simple page scrapes to complex automated workflows.
⭐ 15.3k maxun
Custom Agents Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Typesense is an open-source, typo-tolerant search engine built in C++ for building fast, relevant search experiences. It serves as a self-hostable alternative to Algolia with support for vector search, geo-search, and faceted filtering.
⭐ 25.5k typesense
Custom Agents Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
trdsql is a CLI tool that executes SQL queries directly on CSV, LTSV, JSON, YAML, and TBLN files. It supports PostgreSQL and MySQL syntax, can join data across multiple files and databases, and outputs results in various formats including JSON, Markdown, and vertical display.
⭐ 2.2k trdsql
Custom Agents Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
A terminal-based interactive JSON viewer and jq filter editor written in Rust. jnv lets developers navigate complex JSON structures visually while building and testing jq queries in real time, with syntax highlighting, auto-completion, and clipboard support.
⭐ 6k jnv
Claude Code Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
csvkit is a suite of Python command-line utilities for converting to, working with, and analyzing CSV files. It includes tools for format conversion, querying CSV with SQL, data cleaning, filtering, sorting, and statistical analysis.
⭐ 6.4k csvkit
Custom Agents Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Redpanda Connect (formerly Benthos) is a high-performance stream processor that connects data sources and sinks through declarative YAML pipelines. It supports hundreds of connectors and a built-in mapping language called Bloblang for data transformation.
⭐ 8.6k connect
Custom Agents Data Extraction & Transformation
Data Extraction & Transformation
Security Reviewed
Datasette is an open-source Python tool for exploring and publishing data. It turns any SQLite database into an interactive web interface with a JSON API, enabling data journalists, researchers, and developers to share datasets without writing application code.
⭐ 10.9k datasette
Custom Agents Data Extraction & Transformation