Skill Detail
Turn messy document collections into structured rows with DocETL
Define repeatable extraction pipelines that pull fields from large document collections, normalize outputs, and audit failures across the corpus.
Data Extraction & TransformationMulti-Framework
Data Extraction & Transformation
Multi-Framework
Security Reviewed
โญ 3.7k GitHub stars
INSTALL WITH ANY AGENT
npx skills add agentskillexchange/skills --skill turn-messy-document-collections-into-structured-rows-with-docetl
Works best when you want a reusable capability, not another fragile one-off prompt.
At a glance
Tools required
Python 3.10+, DocETL, document corpus, extraction configuration
Install & setup
Install DocETL from the project instructions, configure the extraction pipeline for your document set, then run the pipeline to emit normalized structured outputs and review failures.
Author
UCB EPIC
Publisher
Organization
Last updated
Apr 15, 2026
Quick brief
Use DocETL when an agent needs to convert a pile of semi-structured documents into rows that downstream systems can trust. The agent can define extraction steps, normalize fields, track failures, and iterate on a repeatable document-to-structured-data pipeline instead of doing one-off parsing. The boundary is tightly around document extraction and auditability, not a generic document platform or LLM framework listing.