Content Extraction API
Extract specific HTML elements (titles, headers, paragraphs) in a structured, LLM-ready format. Feed clean web data directly into RAG pipelines, content analysis tools, or AI applications without running your own scraper infrastructure.
v3.0 enables smart defaults — auto_proxy, auto_render, and retry are all on by default. The best proxy and rendering strategy is chosen automatically for each target domain.
Endpoint
POST https://opengraph.io/api/3.0/extract?app_id=YOUR_APP_IDContent-Type: application/json
Parameters
Body Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
| site | string | Yes | — | The URL to scrape |
| selectors | object | No | — | CSS selector configurations for structured data extraction |
| html_elements | string | No | title,h1,h2,h3,h4,h5,p | Comma-separated HTML tags used to generate concatenatedText |
Query Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| cache_ok | boolean | true | Allow cached results |
| max_cache_age | number | — | Max cache age in seconds |
| full_render | boolean | false | Use headless browser rendering |
| use_proxy | boolean | false | Route request through proxy |
| use_premium | boolean | false | Use premium proxy (requires plan support) |
| use_superior | boolean | false | Use superior proxy (requires plan support) |
| auto_proxy | boolean | true | Automatically select the best proxy for the target domain |
| auto_render | boolean | true | Automatically use headless rendering when beneficial |
| retry | boolean | true | Retry with proxy escalation on failure (requires plan support) |
| max_retries | number | 3 | Max retry attempts (1–4) |
| retry_escalate | boolean | true | Escalate proxy level on each retry |
| ai_sanitize | boolean | false | Enable prompt injection detection |
| ai_sanitize_mode | string | sanitize | One of: sanitize, warn, block |
Selectors Format
The selectors object maps custom keys to CSS selector configurations:
{
"selectors": {
"pageTitle": {
"selector": "h1.title",
"type": "text"
},
"navLinks": {
"selector": "a.nav-link",
"multiple": true,
"type": "attr",
"attr": "href"
},
"firstParagraph": {
"selector": "article p",
"type": "text"
}
}
}Selector Config Options
| Property | Type | Default | Description |
|---|---|---|---|
| selector | string | — | Any valid CSS selector |
| multiple | boolean | false | true returns all matches as an array; false returns only the first match |
| type | string | text | text extracts inner text; attr extracts an HTML attribute value |
| attr | string | — | The attribute to extract (required when type is attr) |
Example Request
curl -X POST "https://opengraph.io/api/3.0/extract?app_id=YOUR_APP_ID" \
-H "Content-Type: application/json" \
-d '{
"site": "https://example.com",
"selectors": {
"heading": {
"selector": "h1",
"type": "text"
},
"allLinks": {
"selector": "a",
"multiple": true,
"type": "attr",
"attr": "href"
}
}
}'Example Response
{
"url": "https://example.com",
"concatenatedText": "Example Domain This domain is for use in illustrative examples...",
"data": {
"heading": "Example Domain",
"allLinks": ["https://www.iana.org/domains/example"]
}
}Response Fields
| Field | Presence | Description |
|---|---|---|
| url | Always | The URL that was requested |
| concatenatedText | Always | Plain text extracted from the specified (or default) html_elements tags, concatenated into a single string |
| data | Only when selectors provided | An object containing extraction results keyed by your selector names |
| ai_safety | Only when ai_sanitize enabled | Prompt injection risk assessment |
LLM Tip: Use concatenatedText when feeding content to AI models for summarization. It provides clean text without HTML markup.
AI Safety
When ai_sanitize is enabled, the response includes an ai_safety object with prompt injection risk assessment:
{
"ai_safety": {
"risk_score": 0.02,
"risk_level": "low",
"signals": {}
}
}Use ai_sanitize_mode to control behavior: sanitize strips detected injections, warn adds flags but keeps content, and block rejects high-risk responses with a 422 error.
Errors
| Status | Code | Condition |
|---|---|---|
| 400 | — | Missing or invalid site URL |
| 400 | -2233 | Plan does not support the requested feature (premium proxy, retry, etc.) |
| 422 | -4001 | ai_sanitize_mode=block and high injection risk was detected |
Use Cases
- AI/LLM data pipelines – feed clean text to language models
- Content analysis and summarization
- SEO content auditing – check heading structure
- Research and data collection
- Automated reporting
MCP Tool
This endpoint is available as the Extract Content tool in the OpenGraph MCP Server. Your AI assistant can extract elements directly without writing any code.