Skill Detail

Extract schema.org, Open Graph, and JSON-LD metadata from web pages for indexing

Uses extruct to pull machine-readable metadata from raw HTML so an agent can classify, deduplicate, or enrich pages without brittle full-page parsing. It is best for metadata harvesting workflows, not for crawling an entire site or rendering JavaScript-heavy pages.

Research & ScrapingMulti-Framework

Research & Scraping Multi-Framework Security Reviewed

⭐ 961 GitHub stars

INSTALL WITH ANY AGENT

npx skills add agentskillexchange/skills --skill extract-schema-org-open-graph-and-json-ld-metadata-from-web-pages-for-indexing

Copy

Works best when you want a reusable capability, not another fragile one-off prompt.

View source

At a glance

Tools required

Python 3 environment

Install & setup

pip install extruct

Author

Scrapinghub

Publisher

Company

Last updated

Apr 11, 2026

Quick brief

This ASE entry is built around extruct, the metadata extraction library published from the scrapinghub/extruct repository. The agent behavior is specific: fetch or receive HTML, extract structured metadata formats such as schema.org, Open Graph, JSON-LD, RDFa, microdata, and related signals, then hand those fields into an indexing, enrichment, or QA pipeline. That makes it useful when an agent needs structured page facts without depending on fragile CSS selectors or trying to infer metadata from raw visible text alone.

How it works

What this skill actually does

Use this when the job is metadata harvesting from known pages. An agent might apply it while building a content index, validating publisher markup, enriching a dataset, deduplicating URLs, or preparing inputs for search and recommendation systems. It is the right tool when you already have page HTML or a fetcher in place and the next step is to turn embedded metadata into normalized records. It is not the right tool for full-browser automation or broad site crawling by itself.

The scope boundary is what makes this skill-shaped instead of product-shaped. This is not a generic scraping platform entry, not a headless browser, and not a full crawler card. The job-to-be-done is extracting embedded metadata from pages that have already been fetched. Integration points include Python fetchers, Scrapy-style pipelines, research bots, indexing jobs, link-preview validation, and content quality checks that need structured metadata fields as inputs for downstream decisions.

Best fit

When to reach for it

Best when the job fits Research & Scraping.
Works naturally with Multi-Framework setups.
Requires Python 3 environment.
Installation is straightforward: pip install extruct

Trust & provenance

Why this listing is credible

Trust status: Security Reviewed.
961 GitHub stars on the linked upstream source.
Last updated Apr 11, 2026.

View source ↗