OpenGraph.io

Content Extraction API

Extract specific HTML elements (titles, headers, paragraphs) in a structured, LLM-ready format. Feed clean web data directly into RAG pipelines, content analysis tools, or AI applications without running your own scraper infrastructure.

API Version

v3.0 enables smart defaults — auto_proxy, auto_render, and retry are all on by default. The best proxy and rendering strategy is chosen automatically for each target domain.

Endpoint

HTTP
POST https://opengraph.io/api/3.0/extract?app_id=YOUR_APP_ID

Content-Type: application/json

Parameters

Body Parameters

ParameterTypeRequiredDefaultDescription
sitestringYesThe URL to scrape
selectorsobjectNoCSS selector configurations for structured data extraction
html_elementsstringNotitle,h1,h2,h3,h4,h5,pComma-separated HTML tags used to generate concatenatedText

Query Parameters

ParameterTypeDefaultDescription
cache_okbooleantrueAllow cached results
max_cache_agenumberMax cache age in seconds
full_renderbooleanfalseUse headless browser rendering
use_proxybooleanfalseRoute request through proxy
use_premiumbooleanfalseUse premium proxy (requires plan support)
use_superiorbooleanfalseUse superior proxy (requires plan support)
auto_proxybooleantrueAutomatically select the best proxy for the target domain
auto_renderbooleantrueAutomatically use headless rendering when beneficial
retrybooleantrueRetry with proxy escalation on failure (requires plan support)
max_retriesnumber3Max retry attempts (1–4)
retry_escalatebooleantrueEscalate proxy level on each retry
ai_sanitizebooleanfalseEnable prompt injection detection
ai_sanitize_modestringsanitizeOne of: sanitize, warn, block

Selectors Format

The selectors object maps custom keys to CSS selector configurations:

Selectors Example
{
  "selectors": {
    "pageTitle": {
      "selector": "h1.title",
      "type": "text"
    },
    "navLinks": {
      "selector": "a.nav-link",
      "multiple": true,
      "type": "attr",
      "attr": "href"
    },
    "firstParagraph": {
      "selector": "article p",
      "type": "text"
    }
  }
}

Selector Config Options

PropertyTypeDefaultDescription
selectorstringAny valid CSS selector
multiplebooleanfalsetrue returns all matches as an array; false returns only the first match
typestringtexttext extracts inner text; attr extracts an HTML attribute value
attrstringThe attribute to extract (required when type is attr)

Example Request

curl -X POST "https://opengraph.io/api/3.0/extract?app_id=YOUR_APP_ID" \
  -H "Content-Type: application/json" \
  -d '{
    "site": "https://example.com",
    "selectors": {
      "heading": {
        "selector": "h1",
        "type": "text"
      },
      "allLinks": {
        "selector": "a",
        "multiple": true,
        "type": "attr",
        "attr": "href"
      }
    }
  }'

Example Response

Response
{
  "url": "https://example.com",
  "concatenatedText": "Example Domain This domain is for use in illustrative examples...",
  "data": {
    "heading": "Example Domain",
    "allLinks": ["https://www.iana.org/domains/example"]
  }
}

Response Fields

FieldPresenceDescription
urlAlwaysThe URL that was requested
concatenatedTextAlwaysPlain text extracted from the specified (or default) html_elements tags, concatenated into a single string
dataOnly when selectors providedAn object containing extraction results keyed by your selector names
ai_safetyOnly when ai_sanitize enabledPrompt injection risk assessment

LLM Tip: Use concatenatedText when feeding content to AI models for summarization. It provides clean text without HTML markup.

AI Safety

When ai_sanitize is enabled, the response includes an ai_safety object with prompt injection risk assessment:

AI Safety Response
{
  "ai_safety": {
    "risk_score": 0.02,
    "risk_level": "low",
    "signals": {}
  }
}

Use ai_sanitize_mode to control behavior: sanitize strips detected injections, warn adds flags but keeps content, and block rejects high-risk responses with a 422 error.

Errors

StatusCodeCondition
400Missing or invalid site URL
400-2233Plan does not support the requested feature (premium proxy, retry, etc.)
422-4001ai_sanitize_mode=block and high injection risk was detected

Use Cases

  • AI/LLM data pipelines – feed clean text to language models
  • Content analysis and summarization
  • SEO content auditing – check heading structure
  • Research and data collection
  • Automated reporting

MCP Tool

This endpoint is available as the Extract Content tool in the OpenGraph MCP Server. Your AI assistant can extract elements directly without writing any code.

Get started with MCP in 2 minutes →

Learn more about MCP integration →

Related