Skip to main content

POST /v1/sync/scrape

Scrapes a URL and returns the result synchronously.

Request

curl -X POST \
'https://api.scrapingpros.com/v1/sync/scrape' \
-H 'Authorization: Bearer <API-KEY>' \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com",
"browser": true
}'

Body Parameters

ParameterTypeRequiredDescription
urlstringYesURL to scrape
browserbooleanNoUse headless browser (default: false)
browser_typestringNoBrowser type when browser: true: "heavy" (camoufox, anti-detection, default) or "light" (pure Playwright, lighter)
use_proxystring | objectNoProxy configuration ("any", "US", or object with retries)
screenshotbooleanNoCapture screenshot (default: false)
languagestringNoBrowser language (e.g.: "en-us")
headersobjectNoCustom HTTP headers (without browser only)
cookiesobjectNoCookies to add before scraping
actionsarrayNoActions to execute in the browser (requires browser: true)
extractobjectNoSelectors for data extraction (requires browser: true)
http_methodobjectNoHTTP method: GET or POST (without browser only)
window_sizestringNoBrowser window size (e.g.: "1920,1080")
network_captureobjectNoCapture network requests during scraping (requires browser: true)
count_for_trackingbooleanNoRegister this request in the project usage tracking (default: false)

network_capture Object

FieldTypeDescription
resource_typesarray | nullFilter by resource type. Valid values: document, stylesheet, image, media, font, script, xhr, fetch, eventsource, websocket, manifest, other. null = capture all.
listen_after_load_msinteger | nullExtra milliseconds to listen after the page loads (maximum 10,000 ms). Useful for capturing background requests.

Response

{
"html": "<html>...</html>",
"statusCode": 200,
"message": "Request completed successfully",
"executionTime": 3.45,
"screenshot": null,
"extracted_data": null,
"evaluate_results": null,
"network_requests": null,
"potentiallyBlockedByCaptcha": false
}
FieldTypeDescription
htmlstringPage HTML
statusCodeintegerHTTP code of the scraped page
messagestringResult message
executionTimefloatExecution time in seconds
screenshotstring | nullBase64 screenshot (if requested)
extracted_dataobject | nullExtracted data (if extract was used)
evaluate_resultsarray | nullResults of executed evaluate actions, in order (one per evaluate)
network_requestsarray | nullList of captured network requests (if network_capture was used)
potentiallyBlockedByCaptchaboolean | nulltrue if the response appears to be a block or captcha page

Structure of each entry in network_requests

FieldTypeDescription
urlstringRequest URL
methodstringHTTP method (GET, POST, etc.)
resource_typestringResource type (xhr, fetch, document, etc.)
statusinteger | nullHTTP response code
content_typestring | nullResponse Content-Type (without parameters, e.g.: application/json)

Examples

Simple scraping (without browser)

{
"url": "https://example.com"
}

With browser and proxy

{
"url": "https://example.com",
"browser": true,
"use_proxy": "any"
}

With proxy and retries

{
"url": "https://example.com",
"use_proxy": {
"proxy": "US",
"max_retries": 3,
"delay_seconds": 1,
"backoff_factor": 2
}
}

With custom headers

{
"url": "https://api.example.com/data",
"headers": {
"ocp-apim-subscription-key": "abc123",
"Accept": "application/json"
}
}

With actions (navigate to login)

{
"url": "https://quotes.toscrape.com",
"browser": true,
"actions": [
{
"type": "click",
"selector": "css:a[href='/login']"
},
{
"type": "wait-for-selector",
"selector": "css:#username",
"time": 5000
}
]
}

With data extraction

{
"url": "https://example.com/products",
"browser": true,
"extract": {
"title": "css:h1",
"prices": {
"selector": "css:.price",
"multiple": true
},
"main_image": {
"selector": "css:img.product-image",
"attribute": "src"
}
}
}

Loop: paginate and collect data

{
"url": "https://example.com/products",
"browser": true,
"actions": [
{
"type": "while",
"condition": {
"type": "selector-visible",
"selector": "css:button.next-page"
},
"actions": [
{
"type": "collect",
"extract": {
"product_names": {
"selector": "css:.product-name",
"multiple": true
}
}
},
{
"type": "click",
"selector": "css:button.next-page"
},
{
"type": "wait-for-timeout",
"time": 2000
}
],
"max_iterations": 10
}
]
}

Execute JavaScript on the page (evaluate)

The evaluate action allows executing arbitrary JavaScript code in the page context. Results accumulate in the evaluate_results array of the response.

{
"url": "https://example.com",
"browser": true,
"actions": [
{
"type": "evaluate",
"script": "document.title"
},
{
"type": "evaluate",
"script": "Array.from(document.querySelectorAll('a')).map(a => a.href)",
"timeout": 15000
}
]
}

Response:

{
"html": "...",
"evaluate_results": [
"Example Domain",
["https://www.iana.org/domains/reserved"]
]
}

Register a request for project tracking

By default, requests are not counted in the tracking. This allows exploring and testing without contaminating the estimate data. Once the scraping is validated, enable the flag so it counts:

{
"url": "https://example.com/products",
"browser": true,
"count_for_tracking": true
}

The accumulated data can be queried from /v1/sync/metrics?metric=tracking and shows, per client, the total counted requests and the breakdown by type (simple vs browser).

Lightweight browser

The browser_type allows choosing between two engines when browser: true:

browser_typeEngineAnti-detectionWhen to use
"heavy" (default)camoufox + PlaywrightYes -- fingerprint spoofing, humanized mouse, geoipSites with CAPTCHA, Cloudflare, or active bot detection
"light"Pure Playwright (headless Chromium)NoSites with dynamic JS but no aggressive anti-bot

When to use "light"

  • The site requires JS to render content, but has no bot detection
  • You want to minimize response time and resource usage
  • Scraping SPAs, internal dashboards, or APIs that return JSON via XHR

When to use "heavy" (or not specify anything)

  • The site returns CAPTCHA or blocks with a simple browser
  • Sites with Cloudflare Bot Management, Akamai, or PerimeterX
  • Any case where human behavior is necessary to pass detection
{
"url": "https://example.com",
"browser": true,
"browser_type": "light"
}

Practical comparison

// Option 1: Without browser (fastest, no JS)
{
"url": "https://example.com/static-page"
}

// Option 2: Lightweight browser (JS + no anti-detection)
{
"url": "https://example.com/spa-app",
"browser": true,
"browser_type": "light"
}

// Option 3: Heavy browser (JS + full anti-detection, default)
{
"url": "https://example.com/protected-page",
"browser": true
}

// Equivalent to the above -- browser_type "heavy" is the default
{
"url": "https://example.com/protected-page",
"browser": true,
"browser_type": "heavy"
}

Lightweight browser with proxy and actions

The browser_type: "light" supports all the same features as "heavy": proxy, cookies, headers, actions, extract, screenshot, and network_capture.

{
"url": "https://app.example.com/dashboard",
"browser": true,
"browser_type": "light",
"use_proxy": "any",
"actions": [
{
"type": "wait-for-selector",
"selector": "css:#data-table",
"time": 5000
}
],
"extract": {
"rows": {
"selector": "css:#data-table tr",
"multiple": true
}
}
}

Lightweight browser with network capture

{
"url": "https://app.example.com/products",
"browser": true,
"browser_type": "light",
"network_capture": {
"resource_types": ["xhr", "fetch"]
}
}

POST request with payload

{
"url": "https://api.example.com/search",
"http_method": {
"method": "post",
"payload": {
"query": "shoes",
"category": "fashion",
"page": 1
}
}
}

POST /v1/sync/download

Downloads a file from a URL (PDF, image, etc.) and returns its content encoded in base64 along with the detected content type.

Request

curl -X POST \
'https://api.scrapingpros.com/v1/sync/download' \
-H 'Authorization: Bearer <API-KEY>' \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com/document.pdf"
}'

Body Parameters

ParameterTypeRequiredDescription
urlstringYesURL of the file to download
use_proxystring | objectNoProxy configuration ("any", "US", or object with retries)

Response

{
"content": "JVBERi0xLjQK...",
"contentType": "application/pdf",
"statusCode": 200,
"message": "OK",
"executionTime": 0.312
}
FieldTypeDescription
contentstring | nullFile content encoded in base64
contentTypestring | nullMIME type of the file (e.g.: application/pdf, image/png)
statusCodeintegerHTTP response code from the server
messagestringResult or error message
executionTimefloat | nullExecution time in seconds

Examples

Download a PDF

{
"url": "https://example.com/document.pdf"
}

Download an image with proxy

{
"url": "https://example.com/image.jpg",
"use_proxy": "any"
}

With country-specific proxy and retries

{
"url": "https://example.com/report.pdf",
"use_proxy": {
"proxy": "US",
"max_retries": 3,
"delay_seconds": 1,
"backoff_factor": 2
}
}

Decoding the content

The content field is in base64. To get the original file:

import base64, requests

response = requests.post(
"https://api.scrapingpros.com/v1/sync/download",
headers={"Authorization": "Bearer <API-KEY>"},
json={"url": "https://example.com/document.pdf"}
).json()

file_bytes = base64.b64decode(response["content"])

with open("document.pdf", "wb") as f:
f.write(file_bytes)
# With bash
echo "<CONTENT>" | base64 --decode > document.pdf

GET /v1/sync/metrics

Gets API usage metrics.

Parameters (query string)

ParameterTypeDescription
datestringDate range: YYYY-MM-DD:YYYY-MM-DD
metricstringMetric type: url, proxy, api_codes, page_codes, exe_time, scrape_type, tracking

Example

curl 'https://api.scrapingpros.com/v1/sync/metrics?date=2026-03-01:2026-03-25&metric=scrape_type' \
-H 'Authorization: Bearer <API-KEY>'

Response

{
"date": "2026-03-01:2026-03-25",
"scrape_type": {
"browser": {
"total": 1500,
"success": 1420,
"failed": 80,
"success_rate": 94.67,
"percentage_of_total": 60.0
},
"simple": {
"total": 1000,
"success": 980,
"failed": 20,
"success_rate": 98.0,
"percentage_of_total": 40.0
},
"total_requests": 2500
}
}

Project usage tracking

curl 'https://api.scrapingpros.com/v1/sync/metrics?metric=tracking' \
-H 'Authorization: Bearer <API-KEY>'
{
"date": "2026-03-30",
"tracking": {
"cliente-lupa": {
"total": 1200,
"browser": 800,
"simple": 400
},
"cliente-ocula": {
"total": 300,
"browser": 0,
"simple": 300
}
}
}

The counters are cumulative (not reset by date) and only increment when the request includes "count_for_tracking": true.

Network request capture (discover internal APIs)

{
"url": "https://example.com",
"browser": true,
"network_capture": {
"resource_types": ["xhr", "fetch"]
}
}

Response with network_requests:

{
"html": "...",
"statusCode": 200,
"network_requests": [
{
"url": "https://api.example.com/v2/products?page=1",
"method": "GET",
"resource_type": "fetch",
"status": 200,
"content_type": "application/json"
},
{
"url": "https://api.example.com/v2/user/me",
"method": "GET",
"resource_type": "xhr",
"status": 401,
"content_type": "application/json"
}
]
}

Listen for post-load requests (background polling)

{
"url": "https://example.com/dashboard",
"browser": true,
"network_capture": {
"resource_types": ["xhr", "fetch"],
"listen_after_load_ms": 3000
}
}

GET /v1/sync/client-metrics

Gets per-client usage metrics for the authenticated client. Administrators can query metrics for any client.

Parameters (query string)

ParameterTypeDescription
datestringYYYY-MM-DD (default: today), YYYY-MM-DD:YYYY-MM-DD (range), or YYYY-MM (full month)
clientstring(Admin only) Filter by specific client_id
hourlybooleanIf true, includes hourly breakdown (single day only)
detailstringIf "urls", includes per-domain breakdown with browser_success, browser_failed, simple_success, simple_failed

Example

curl 'https://api.scrapingpros.com/v1/sync/client-metrics?date=2026-03-25&hourly=true' \
-H 'Authorization: Bearer <API-KEY>'

Example with per-domain breakdown

curl 'https://api.scrapingpros.com/v1/sync/client-metrics?date=2026-03&detail=urls' \
-H 'Authorization: Bearer <API-KEY>'

GET /v1/sync/billing

Gets a monthly billing summary per client, calculated from MySQL (precise data, not a Redis approximation).

Parameters (query string)

ParameterTypeDescription
monthstringMonth in YYYY-MM format (default: current month)
clientstring(Admin only) Filter by specific client_id
detailstringIf "urls", includes per-domain breakdown (by_url)

Request

curl 'https://api.scrapingpros.com/v1/sync/billing?month=2026-03' \
-H 'Authorization: Bearer <API-KEY>'

Response

{
"month": "2026-03",
"clients": {
"my-client": {
"simple_success": 15000,
"simple_failed": 200,
"simple_total": 15200,
"browser_success": 8000,
"browser_failed": 150,
"browser_total": 8150,
"total_requests": 23350,
"total_success": 23000,
"total_failed": 350
}
}
}

Response with detail=urls

{
"month": "2026-03",
"clients": {
"my-client": {
"simple_success": 15000,
"simple_failed": 200,
"simple_total": 15200,
"browser_success": 8000,
"browser_failed": 150,
"browser_total": 8150,
"total_requests": 23350,
"total_success": 23000,
"total_failed": 350,
"by_url": {
"example.com": {
"simple_success": 10000,
"simple_failed": 100,
"browser_success": 5000,
"browser_failed": 50,
"total": 15150
}
}
}
}
}
FieldTypeDescription
simple_successintegerSuccessful requests without browser
simple_failedintegerFailed requests without browser
simple_totalintegerTotal requests without browser
browser_successintegerSuccessful requests with browser
browser_failedintegerFailed requests with browser
browser_totalintegerTotal requests with browser
total_requestsintegerTotal of all requests
total_successintegerTotal successful requests
total_failedintegerTotal failed requests
by_urlobject | nullPer-domain breakdown (only with detail=urls)

GET /v1/proxy/countries

Lists available countries for geographic proxies.

Request

curl 'https://api.scrapingpros.com/v1/proxy/countries' \
-H 'Authorization: Bearer <API-KEY>'

Response

{
"countries": ["US", "GB", "MX", "BR", "AR", "DE", "FR", "ES"]
}
FieldTypeDescription
countriesarrayList of available ISO 3166-1 alpha-2 country codes
errorstring | nullError message if the proxy service could not be queried

POST /v1/proxy/request-country

Requests access to proxies from a specific country. The request remains pending until an administrator approves it.

Request

curl -X POST \
'https://api.scrapingpros.com/v1/proxy/request-country' \
-H 'Authorization: Bearer <API-KEY>' \
-H 'Content-Type: application/json' \
-d '{
"country_code": "US",
"reason": "We need to scrape prices on Amazon US"
}'

Body Parameters

ParameterTypeRequiredDescription
country_codestringYesISO 3166-1 alpha-2 country code (e.g.: "US", "MX")
reasonstringNoReason for the request

Response

{
"status": "pending",
"country_code": "US",
"message": "Request submitted. An admin will review and approve your access."
}
FieldTypeDescription
statusstringRequest status: "pending", "already_approved", or "error"
country_codestringRequested country code
messagestringDescriptive result message

GET /v1/proxy/status

Shows the proxy approval status by country for the authenticated client.

Request

curl 'https://api.scrapingpros.com/v1/proxy/status' \
-H 'Authorization: Bearer <API-KEY>'

Response

{
"client_id": "my-client",
"approved_countries": ["US", "GB"],
"pending_countries": ["MX"]
}
FieldTypeDescription
client_idstringAuthenticated client identifier
approved_countriesarrayCountries already approved for use as proxy
pending_countriesarrayCountries with pending approval request

GET /v1/health

Health check of all API components. No authentication required.

Request

curl 'https://api.scrapingpros.com/v1/health'

Response

{
"status": "healthy",
"checks": {
"redis": {"status": "ok", "latency_ms": 1.2},
"mysql": {"status": "ok", "latency_ms": 3.5},
"proxies_api": {
"internal": {"status": "ok", "latency_ms": 15.0},
"external": {"status": "ok", "latency_ms": 120.0}
},
"workers": {
"sync": {"up": 50, "expected": 50},
"async": {"up": 6, "expected": 6}
},
"queues": {
"pending_jobs": 0,
"async_scheduler": 0
}
},
"uptime_seconds": 86400
}
FieldTypeDescription
statusstringOverall status: healthy, degraded, or unhealthy
checks.redisobjectRedis status and latency
checks.mysqlobjectMySQL status and latency
checks.proxies_apiobjectInternal and external proxy status
checks.workersobjectActive vs expected workers (sync and async)
checks.queuesobjectJob queue depth
uptime_secondsintegerSeconds since the API started