Skill Detail

Apache Tika Document Parser

Extracts structured text, metadata, and embedded objects from PDFs, Office documents, and 1000+ file formats using the Apache Tika REST API. Outputs clean Markdown or JSON with XMP metadata preservation.

Data Extraction & TransformationGemini
Data Extraction & Transformation Gemini Security Reviewed
โญ 3.7k GitHub stars
INSTALL WITH ANY AGENT
npx skills add agentskillexchange/skills --skill apache-tika-document-parser Copy
Works best when you want a reusable capability, not another fragile one-off prompt.
At a glance
Author
The Apache Software Foundation
Last updated
Mar 24, 2026
Quick brief

The Apache Tika Document Parser skill provides universal document extraction using the Apache Tika content analysis framework via its REST API. It handles over 1,000 file formats including PDF, DOCX, XLSX, PPTX, EML, MSG, HTML, and legacy formats like WPD and RTF.

How it works

What this skill actually does

The skill sends documents to a Tika server instance and retrieves extracted text in multiple output formats: plain text, XHTML, or structured JSON. It preserves document metadata including XMP, Dublin Core, and format-specific properties. For PDFs, it leverages Tika’s OCR integration via Tesseract for scanned document text extraction.

Advanced capabilities include recursive extraction of embedded objects (images, attachments in emails, OLE objects in Office documents), language detection using Tika’s language identifier, and content-type detection independent of file extensions. The skill supports batch processing with configurable parallelism and outputs clean Markdown suitable for LLM consumption or vector embedding pipelines.