Skill Detail

Apache Tika Document Parser

Extracts structured text, metadata, and embedded objects from PDFs, Office documents, and 1000+ file formats using the Apache Tika REST API. Outputs clean Markdown or JSON with XMP metadata preservation.

Data Extraction & TransformationGemini

Data Extraction & Transformation Gemini Security Reviewed

⭐ 3.7k GitHub stars

INSTALL WITH ANY AGENT

npx skills add agentskillexchange/skills --skill apache-tika-document-parser Copy

Works best when you want a reusable capability, not another fragile one-off prompt.

View source

At a glance

Author

The Apache Software Foundation

Last updated

Mar 24, 2026

Quick brief

The Apache Tika Document Parser skill provides universal document extraction using the Apache Tika content analysis framework via its REST API. It handles over 1,000 file formats including PDF, DOCX, XLSX, PPTX, EML, MSG, HTML, and legacy formats like WPD and RTF.

How it works

What this skill actually does

The skill sends documents to a Tika server instance and retrieves extracted text in multiple output formats: plain text, XHTML, or structured JSON. It preserves document metadata including XMP, Dublin Core, and format-specific properties. For PDFs, it leverages Tika’s OCR integration via Tesseract for scanned document text extraction.

Advanced capabilities include recursive extraction of embedded objects (images, attachments in emails, OLE objects in Office documents), language detection using Tika’s language identifier, and content-type detection independent of file extensions. The skill supports batch processing with configurable parallelism and outputs clean Markdown suitable for LLM consumption or vector embedding pipelines.

Best fit

When to reach for it

Best when the job fits Data Extraction & Transformation.
Works naturally with Gemini setups.

Trust & provenance

Why this listing is credible

Trust status: Security Reviewed.
3.7k GitHub stars on the linked upstream source.
Last updated Mar 24, 2026.

View source ↗