POST /v1/sync/scrape
Scrapes a URL and returns the result synchronously.
Request
curl -X POST \
'https://api.scrapingpros.com/v1/sync/scrape' \
-H 'Authorization: Bearer <API-KEY>' \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com",
"browser": true
}'
Body Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
url | string | Yes | URL to scrape |
browser | boolean | No | Use headless browser (default: false) |
browser_type | string | No | Browser type when browser: true: "heavy" (camoufox, anti-detection, default) or "light" (pure Playwright, lighter) |
use_proxy | string | object | No | Proxy configuration ("any", "US", or object with retries) |
screenshot | boolean | No | Capture screenshot (default: false) |
language | string | No | Browser language (e.g.: "en-us") |
headers | object | No | Custom HTTP headers (without browser only) |
cookies | object | No | Cookies to add before scraping |
actions | array | No | Actions to execute in the browser (requires browser: true) |
extract | object | No | Selectors for data extraction (requires browser: true) |
http_method | object | No | HTTP method: GET or POST (without browser only) |
window_size | string | No | Browser window size (e.g.: "1920,1080") |
network_capture | object | No | Capture network requests during scraping (requires browser: true) |
count_for_tracking | boolean | No | Register this request in the project usage tracking (default: false) |
network_capture Object
| Field | Type | Description |
|---|---|---|
resource_types | array | null | Filter by resource type. Valid values: document, stylesheet, image, media, font, script, xhr, fetch, eventsource, websocket, manifest, other. null = capture all. |
listen_after_load_ms | integer | null | Extra milliseconds to listen after the page loads (maximum 10,000 ms). Useful for capturing background requests. |
Response
{
"html": "<html>...</html>",
"statusCode": 200,
"message": "Request completed successfully",
"executionTime": 3.45,
"screenshot": null,
"extracted_data": null,
"evaluate_results": null,
"network_requests": null,
"potentiallyBlockedByCaptcha": false
}
| Field | Type | Description |
|---|---|---|
html | string | Page HTML |
statusCode | integer | HTTP code of the scraped page |
message | string | Result message |
executionTime | float | Execution time in seconds |
screenshot | string | null | Base64 screenshot (if requested) |
extracted_data | object | null | Extracted data (if extract was used) |
evaluate_results | array | null | Results of executed evaluate actions, in order (one per evaluate) |
network_requests | array | null | List of captured network requests (if network_capture was used) |
potentiallyBlockedByCaptcha | boolean | null | true if the response appears to be a block or captcha page |
Structure of each entry in network_requests
| Field | Type | Description |
|---|---|---|
url | string | Request URL |
method | string | HTTP method (GET, POST, etc.) |
resource_type | string | Resource type (xhr, fetch, document, etc.) |
status | integer | null | HTTP response code |
content_type | string | null | Response Content-Type (without parameters, e.g.: application/json) |
Examples
Simple scraping (without browser)
{
"url": "https://example.com"
}
With browser and proxy
{
"url": "https://example.com",
"browser": true,
"use_proxy": "any"
}
With proxy and retries
{
"url": "https://example.com",
"use_proxy": {
"proxy": "US",
"max_retries": 3,
"delay_seconds": 1,
"backoff_factor": 2
}
}
With custom headers
{
"url": "https://api.example.com/data",
"headers": {
"ocp-apim-subscription-key": "abc123",
"Accept": "application/json"
}
}
With actions (navigate to login)
{
"url": "https://quotes.toscrape.com",
"browser": true,
"actions": [
{
"type": "click",
"selector": "css:a[href='/login']"
},
{
"type": "wait-for-selector",
"selector": "css:#username",
"time": 5000
}
]
}
With data extraction
{
"url": "https://example.com/products",
"browser": true,
"extract": {
"title": "css:h1",
"prices": {
"selector": "css:.price",
"multiple": true
},
"main_image": {
"selector": "css:img.product-image",
"attribute": "src"
}
}
}
Loop: paginate and collect data
{
"url": "https://example.com/products",
"browser": true,
"actions": [
{
"type": "while",
"condition": {
"type": "selector-visible",
"selector": "css:button.next-page"
},
"actions": [
{
"type": "collect",
"extract": {
"product_names": {
"selector": "css:.product-name",
"multiple": true
}
}
},
{
"type": "click",
"selector": "css:button.next-page"
},
{
"type": "wait-for-timeout",
"time": 2000
}
],
"max_iterations": 10
}
]
}
Execute JavaScript on the page (evaluate)
The evaluate action allows executing arbitrary JavaScript code in the page context. Results accumulate in the evaluate_results array of the response.
{
"url": "https://example.com",
"browser": true,
"actions": [
{
"type": "evaluate",
"script": "document.title"
},
{
"type": "evaluate",
"script": "Array.from(document.querySelectorAll('a')).map(a => a.href)",
"timeout": 15000
}
]
}
Response:
{
"html": "...",
"evaluate_results": [
"Example Domain",
["https://www.iana.org/domains/reserved"]
]
}
Register a request for project tracking
By default, requests are not counted in the tracking. This allows exploring and testing without contaminating the estimate data. Once the scraping is validated, enable the flag so it counts:
{
"url": "https://example.com/products",
"browser": true,
"count_for_tracking": true
}
The accumulated data can be queried from /v1/sync/metrics?metric=tracking and shows, per client, the total counted requests and the breakdown by type (simple vs browser).
Lightweight browser
The browser_type allows choosing between two engines when browser: true:
browser_type | Engine | Anti-detection | When to use |
|---|---|---|---|
"heavy" (default) | camoufox + Playwright | Yes -- fingerprint spoofing, humanized mouse, geoip | Sites with CAPTCHA, Cloudflare, or active bot detection |
"light" | Pure Playwright (headless Chromium) | No | Sites with dynamic JS but no aggressive anti-bot |
When to use "light"
- The site requires JS to render content, but has no bot detection
- You want to minimize response time and resource usage
- Scraping SPAs, internal dashboards, or APIs that return JSON via XHR
When to use "heavy" (or not specify anything)
- The site returns CAPTCHA or blocks with a simple browser
- Sites with Cloudflare Bot Management, Akamai, or PerimeterX
- Any case where human behavior is necessary to pass detection
{
"url": "https://example.com",
"browser": true,
"browser_type": "light"
}
Practical comparison
// Option 1: Without browser (fastest, no JS)
{
"url": "https://example.com/static-page"
}
// Option 2: Lightweight browser (JS + no anti-detection)
{
"url": "https://example.com/spa-app",
"browser": true,
"browser_type": "light"
}
// Option 3: Heavy browser (JS + full anti-detection, default)
{
"url": "https://example.com/protected-page",
"browser": true
}
// Equivalent to the above -- browser_type "heavy" is the default
{
"url": "https://example.com/protected-page",
"browser": true,
"browser_type": "heavy"
}
Lightweight browser with proxy and actions
The browser_type: "light" supports all the same features as "heavy": proxy, cookies, headers, actions, extract, screenshot, and network_capture.
{
"url": "https://app.example.com/dashboard",
"browser": true,
"browser_type": "light",
"use_proxy": "any",
"actions": [
{
"type": "wait-for-selector",
"selector": "css:#data-table",
"time": 5000
}
],
"extract": {
"rows": {
"selector": "css:#data-table tr",
"multiple": true
}
}
}
Lightweight browser with network capture
{
"url": "https://app.example.com/products",
"browser": true,
"browser_type": "light",
"network_capture": {
"resource_types": ["xhr", "fetch"]
}
}
POST request with payload
{
"url": "https://api.example.com/search",
"http_method": {
"method": "post",
"payload": {
"query": "shoes",
"category": "fashion",
"page": 1
}
}
}
POST /v1/sync/download
Downloads a file from a URL (PDF, image, etc.) and returns its content encoded in base64 along with the detected content type.
Request
curl -X POST \
'https://api.scrapingpros.com/v1/sync/download' \
-H 'Authorization: Bearer <API-KEY>' \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com/document.pdf"
}'
Body Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
url | string | Yes | URL of the file to download |
use_proxy | string | object | No | Proxy configuration ("any", "US", or object with retries) |
Response
{
"content": "JVBERi0xLjQK...",
"contentType": "application/pdf",
"statusCode": 200,
"message": "OK",
"executionTime": 0.312
}
| Field | Type | Description |
|---|---|---|
content | string | null | File content encoded in base64 |
contentType | string | null | MIME type of the file (e.g.: application/pdf, image/png) |
statusCode | integer | HTTP response code from the server |
message | string | Result or error message |
executionTime | float | null | Execution time in seconds |
Examples
Download a PDF
{
"url": "https://example.com/document.pdf"
}
Download an image with proxy
{
"url": "https://example.com/image.jpg",
"use_proxy": "any"
}
With country-specific proxy and retries
{
"url": "https://example.com/report.pdf",
"use_proxy": {
"proxy": "US",
"max_retries": 3,
"delay_seconds": 1,
"backoff_factor": 2
}
}
Decoding the content
The content field is in base64. To get the original file:
import base64, requests
response = requests.post(
"https://api.scrapingpros.com/v1/sync/download",
headers={"Authorization": "Bearer <API-KEY>"},
json={"url": "https://example.com/document.pdf"}
).json()
file_bytes = base64.b64decode(response["content"])
with open("document.pdf", "wb") as f:
f.write(file_bytes)
# With bash
echo "<CONTENT>" | base64 --decode > document.pdf
GET /v1/sync/metrics
Gets API usage metrics.
Parameters (query string)
| Parameter | Type | Description |
|---|---|---|
date | string | Date range: YYYY-MM-DD:YYYY-MM-DD |
metric | string | Metric type: url, proxy, api_codes, page_codes, exe_time, scrape_type, tracking |
Example
curl 'https://api.scrapingpros.com/v1/sync/metrics?date=2026-03-01:2026-03-25&metric=scrape_type' \
-H 'Authorization: Bearer <API-KEY>'
Response
{
"date": "2026-03-01:2026-03-25",
"scrape_type": {
"browser": {
"total": 1500,
"success": 1420,
"failed": 80,
"success_rate": 94.67,
"percentage_of_total": 60.0
},
"simple": {
"total": 1000,
"success": 980,
"failed": 20,
"success_rate": 98.0,
"percentage_of_total": 40.0
},
"total_requests": 2500
}
}
Project usage tracking
curl 'https://api.scrapingpros.com/v1/sync/metrics?metric=tracking' \
-H 'Authorization: Bearer <API-KEY>'
{
"date": "2026-03-30",
"tracking": {
"cliente-lupa": {
"total": 1200,
"browser": 800,
"simple": 400
},
"cliente-ocula": {
"total": 300,
"browser": 0,
"simple": 300
}
}
}
The counters are cumulative (not reset by date) and only increment when the request includes "count_for_tracking": true.
Network request capture (discover internal APIs)
{
"url": "https://example.com",
"browser": true,
"network_capture": {
"resource_types": ["xhr", "fetch"]
}
}
Response with network_requests:
{
"html": "...",
"statusCode": 200,
"network_requests": [
{
"url": "https://api.example.com/v2/products?page=1",
"method": "GET",
"resource_type": "fetch",
"status": 200,
"content_type": "application/json"
},
{
"url": "https://api.example.com/v2/user/me",
"method": "GET",
"resource_type": "xhr",
"status": 401,
"content_type": "application/json"
}
]
}
Listen for post-load requests (background polling)
{
"url": "https://example.com/dashboard",
"browser": true,
"network_capture": {
"resource_types": ["xhr", "fetch"],
"listen_after_load_ms": 3000
}
}
GET /v1/sync/client-metrics
Gets per-client usage metrics for the authenticated client. Administrators can query metrics for any client.
Parameters (query string)
| Parameter | Type | Description |
|---|---|---|
date | string | YYYY-MM-DD (default: today), YYYY-MM-DD:YYYY-MM-DD (range), or YYYY-MM (full month) |
client | string | (Admin only) Filter by specific client_id |
hourly | boolean | If true, includes hourly breakdown (single day only) |
detail | string | If "urls", includes per-domain breakdown with browser_success, browser_failed, simple_success, simple_failed |
Example
curl 'https://api.scrapingpros.com/v1/sync/client-metrics?date=2026-03-25&hourly=true' \
-H 'Authorization: Bearer <API-KEY>'
Example with per-domain breakdown
curl 'https://api.scrapingpros.com/v1/sync/client-metrics?date=2026-03&detail=urls' \
-H 'Authorization: Bearer <API-KEY>'
GET /v1/sync/billing
Gets a monthly billing summary per client, calculated from MySQL (precise data, not a Redis approximation).
Parameters (query string)
| Parameter | Type | Description |
|---|---|---|
month | string | Month in YYYY-MM format (default: current month) |
client | string | (Admin only) Filter by specific client_id |
detail | string | If "urls", includes per-domain breakdown (by_url) |
Request
curl 'https://api.scrapingpros.com/v1/sync/billing?month=2026-03' \
-H 'Authorization: Bearer <API-KEY>'
Response
{
"month": "2026-03",
"clients": {
"my-client": {
"simple_success": 15000,
"simple_failed": 200,
"simple_total": 15200,
"browser_success": 8000,
"browser_failed": 150,
"browser_total": 8150,
"total_requests": 23350,
"total_success": 23000,
"total_failed": 350
}
}
}
Response with detail=urls
{
"month": "2026-03",
"clients": {
"my-client": {
"simple_success": 15000,
"simple_failed": 200,
"simple_total": 15200,
"browser_success": 8000,
"browser_failed": 150,
"browser_total": 8150,
"total_requests": 23350,
"total_success": 23000,
"total_failed": 350,
"by_url": {
"example.com": {
"simple_success": 10000,
"simple_failed": 100,
"browser_success": 5000,
"browser_failed": 50,
"total": 15150
}
}
}
}
}
| Field | Type | Description |
|---|---|---|
simple_success | integer | Successful requests without browser |
simple_failed | integer | Failed requests without browser |
simple_total | integer | Total requests without browser |
browser_success | integer | Successful requests with browser |
browser_failed | integer | Failed requests with browser |
browser_total | integer | Total requests with browser |
total_requests | integer | Total of all requests |
total_success | integer | Total successful requests |
total_failed | integer | Total failed requests |
by_url | object | null | Per-domain breakdown (only with detail=urls) |
GET /v1/proxy/countries
Lists available countries for geographic proxies.
Request
curl 'https://api.scrapingpros.com/v1/proxy/countries' \
-H 'Authorization: Bearer <API-KEY>'
Response
{
"countries": ["US", "GB", "MX", "BR", "AR", "DE", "FR", "ES"]
}
| Field | Type | Description |
|---|---|---|
countries | array | List of available ISO 3166-1 alpha-2 country codes |
error | string | null | Error message if the proxy service could not be queried |
POST /v1/proxy/request-country
Requests access to proxies from a specific country. The request remains pending until an administrator approves it.
Request
curl -X POST \
'https://api.scrapingpros.com/v1/proxy/request-country' \
-H 'Authorization: Bearer <API-KEY>' \
-H 'Content-Type: application/json' \
-d '{
"country_code": "US",
"reason": "We need to scrape prices on Amazon US"
}'
Body Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
country_code | string | Yes | ISO 3166-1 alpha-2 country code (e.g.: "US", "MX") |
reason | string | No | Reason for the request |
Response
{
"status": "pending",
"country_code": "US",
"message": "Request submitted. An admin will review and approve your access."
}
| Field | Type | Description |
|---|---|---|
status | string | Request status: "pending", "already_approved", or "error" |
country_code | string | Requested country code |
message | string | Descriptive result message |
GET /v1/proxy/status
Shows the proxy approval status by country for the authenticated client.
Request
curl 'https://api.scrapingpros.com/v1/proxy/status' \
-H 'Authorization: Bearer <API-KEY>'
Response
{
"client_id": "my-client",
"approved_countries": ["US", "GB"],
"pending_countries": ["MX"]
}
| Field | Type | Description |
|---|---|---|
client_id | string | Authenticated client identifier |
approved_countries | array | Countries already approved for use as proxy |
pending_countries | array | Countries with pending approval request |
GET /v1/health
Health check of all API components. No authentication required.
Request
curl 'https://api.scrapingpros.com/v1/health'
Response
{
"status": "healthy",
"checks": {
"redis": {"status": "ok", "latency_ms": 1.2},
"mysql": {"status": "ok", "latency_ms": 3.5},
"proxies_api": {
"internal": {"status": "ok", "latency_ms": 15.0},
"external": {"status": "ok", "latency_ms": 120.0}
},
"workers": {
"sync": {"up": 50, "expected": 50},
"async": {"up": 6, "expected": 6}
},
"queues": {
"pending_jobs": 0,
"async_scheduler": 0
}
},
"uptime_seconds": 86400
}
| Field | Type | Description |
|---|---|---|
status | string | Overall status: healthy, degraded, or unhealthy |
checks.redis | object | Redis status and latency |
checks.mysql | object | MySQL status and latency |
checks.proxies_api | object | Internal and external proxy status |
checks.workers | object | Active vs expected workers (sync and async) |
checks.queues | object | Job queue depth |
uptime_seconds | integer | Seconds since the API started |