Skill Detail

pdfplumber Python PDF Text and Table Extraction Library

pdfplumber is a Python library for extracting detailed information from PDFs — text, tables, lines, rectangles, and curves — with visual debugging support. Built on pdfminer.six, it excels at structured table extraction from machine-generated PDFs and includes both a Python API and CLI.

Data Extraction & TransformationCustom Agents

Data Extraction & Transformation Custom Agents Security Reviewed

⭐ 10.1k GitHub stars

INSTALL WITH ANY AGENT

npx skills add agentskillexchange/skills --skill pdfplumber-python-pdf-text-table-extraction Copy

Works best when you want a reusable capability, not another fragile one-off prompt.

View source

At a glance

Last updated

Jun 3, 2026

Quick brief

pdfplumber is a Python library for plumbing PDFs for detailed information about every text character, rectangle, line, curve, and image on each page. Its primary strength is structured table extraction from machine-generated PDFs, with a visual debugging system that lets you see exactly what the parser detects. The library is maintained by Jeremy Singer-Vine and has active GitHub adoption and millions of PyPI downloads.

How it works

What this skill actually does

Core Capabilities

pdfplumber provides access to every visual element on a PDF page. The page.chars property returns a list of dictionaries describing each character’s text, font, size, and bounding box. Similarly, page.lines, page.rects, page.curves, and page.images expose geometric and visual elements. This granular access makes pdfplumber useful for PDFs where layout matters — financial statements, government forms, invoices, and research papers.

Table Extraction

The page.extract_tables() and page.extract_table() methods identify table structures by detecting intersections of lines and edges. Table extraction is highly configurable through a table_settings dictionary that controls parameters like vertical and horizontal strategy (lines, text, or explicit), minimum words per horizontal and vertical lines, snap tolerances, and edge detection. This makes it possible to tune extraction for specific PDF layouts.

Text Extraction

The page.extract_text() method reconstructs readable text from character-level data, with a layout=True option that preserves spatial positioning. Additional parameters control x and y tolerances for grouping characters into words and lines, making it adaptable to PDFs with unusual spacing or formatting.

Visual Debugging

pdfplumber includes a visual debugging system via page.to_image() that renders the page as a PIL Image with overlays. You can draw bounding boxes around detected characters, tables, lines, and rectangles to visually verify what the parser finds. This is invaluable for diagnosing extraction issues.

Cropping and Filtering

Pages can be cropped with page.crop(bounding_box) to focus extraction on specific regions, and objects can be filtered with page.filter() to remove unwanted elements before extraction. These operations return new Page-like objects that support all the same extraction methods.

Form Value Extraction

For interactive PDF forms (AcroForms), pdfplumber can extract form field values and annotations, providing access to filled-in form data programmatically.

CLI Interface

pdfplumber includes a command-line interface that outputs CSV, JSON, or plain text. Run pdfplumber input.pdf to get a CSV of all objects, or use --format text for layout-preserved plain text output. The --pages flag supports page ranges.

Installation and Compatibility

Install with pip install pdfplumber. The library supports Python 3.10 through 3.14 and is built on pdfminer.six. It is MIT-licensed and available at github.com/jsvine/pdfplumber and pypi.org/project/pdfplumber.

Best fit

When to reach for it

Best when the job fits Data Extraction & Transformation.
Works naturally with Custom Agents setups.

Trust & provenance

Why this listing is credible

Trust status: Security Reviewed.
10.1k GitHub stars on the linked upstream source.
Last updated Jun 3, 2026.

View source ↗