Skill Detail

Apache Spark DataFrame ETL Pipeline

Automates PySpark DataFrame transformations including schema inference, partition pruning, and Delta Lake merge operations. Integrates with AWS Glue Data Catalog and Apache Iceberg table formats for lakehouse architectures.

Data Extraction & TransformationOpenClaw

Data Extraction & Transformation OpenClaw Security Reviewed

Tool match: spark ⭐ 43.1k GitHub stars Apache-2.0 license

INSTALL WITH ANY AGENT

npx skills add agentskillexchange/skills --skill spark-dataframe-etl-pipeline Copy

Works best when you want a reusable capability, not another fragile one-off prompt.

View source

At a glance

Last updated

Mar 24, 2026

Quick brief

The Apache Spark DataFrame ETL Pipeline skill automates complex data engineering workflows using PySpark and the Spark SQL API. It handles schema inference from heterogeneous data sources including Parquet, ORC, Avro, and JSON formats, applying automatic type coercion and null handling strategies.

How it works

What this skill actually does

Key capabilities include partition pruning optimization for large-scale datasets, predicate pushdown to minimize I/O, and adaptive query execution tuning. The skill integrates natively with Delta Lake for ACID-compliant merge operations (MERGE INTO), enabling upsert patterns across billion-row tables.

For cloud-native deployments, it connects to AWS Glue Data Catalog for centralized metadata management, supports Apache Iceberg table formats for schema evolution, and configures Spark session parameters for optimal memory and shuffle partition settings. Data quality checks via Great Expectations are built in, validating row counts, null percentages, and statistical distributions before committing writes.

Best fit

When to reach for it

Best when the job fits Data Extraction & Transformation.
Works naturally with OpenClaw setups.

Trust & provenance

Why this listing is credible

Built around the spark toolchain.
Trust status: Security Reviewed.
43.1k GitHub stars on the linked upstream source.
License: Apache-2.0.
Last updated Mar 24, 2026.

View source ↗