Skill Detail

Parquet Schema Extractor for S3

Extracts and validates Parquet file schemas from Amazon S3 using the PyArrow library and AWS S3 SDK (boto3). Compares schemas across multiple partitions to detect schema drift and incompatible type changes. Outputs a schema diff report with partition paths and affected column details.

Data Extraction & TransformationGemini

Data Extraction & Transformation Gemini Security Reviewed

Tool match: parquet ⭐ 387 GitHub stars ⬇ 170.7k/wk npm MIT license ⚠ Repository looks unmaintained

INSTALL WITH ANY AGENT

npx skills add agentskillexchange/skills --skill parquet-schema-extractor-for-s3 Copy

Works best when you want a reusable capability, not another fragile one-off prompt.

View source

At a glance

Last updated

Mar 19, 2026

Quick brief

This skill uses boto3 to list objects in an S3 prefix and downloads Parquet file footers using range requests (GetObject with Range header) to avoid full file downloads. PyArrow is used to parse the Parquet metadata footer and extract the schema including field names, data types, and nullable flags. The skill compares schemas across partition directories by hashing schema fingerprints and flagging deviations. Type compatibility checks follow Arrow type promotion rules to identify breaking changes (e.g., INT32 to STRING conversions). Schema history is optionally stored in an AWS DynamoDB table for drift tracking across daily pipeline runs. Output includes a Markdown schema diff report with the exact partition paths containing incompatible schemas and recommended schema evolution strategies for Apache Iceberg or Delta Lake tables.

Best fit

When to reach for it

Best when the job fits Data Extraction & Transformation.
Works naturally with Gemini setups.

Trust & provenance

Why this listing is credible

Built around the parquet toolchain.
Trust status: Security Reviewed.
387 GitHub stars on the linked upstream source.
170.7k/week npm downloads recorded.
License: MIT.
Last updated Mar 19, 2026.

View source ↗