Scraping Pros API
Documentation to facilitate the use of the Scraping Pros API.
Sync Endpoints
POST /v1/sync/scrape
This endpoint extracts the HTML from a page. It returns the HTML as soon as it is obtained. A request body with the following structure is required:
{
"url": "https://example.com",
"browser": true,
"language": "en-us",
"screenshot": true,
"use_proxy": {
"proxy": "any",
"max_retries": 3,
"delay_seconds": 1,
"backoff_factor": 2
},
"headers": {
"X-Custom-Header": "value"
},
"actions": [
{
"type": "input",
"selector": "//input[@name='search']",
"text": "search query"
},
{
"type": "click",
"selector": "css:button[type='submit']",
"wait_for_navigation": true
}
],
"extract": {
"title": "css:h1",
"prices": {
"selector": "css:.price",
"multiple": true
}
},
"cookies": {
"key_1": "value_1",
"key_2": "value_2"
}
}
Response
{
"html": "<html>...</html>",
"markdown": null,
"statusCode": 200,
"message": "Request completed successfully",
"executionTime": 3.45,
"screenshot": "base64...",
"extracted_data": {
"title": "Example Domain",
"prices": ["$10.99", "$24.99", "$5.00"]
},
"evaluate_results": null,
"network_requests": null,
"potentiallyBlockedByCaptcha": false,
"timings": {
"queue_wait_ms": 45,
"proxy_ms": 120,
"browser_launch_ms": 2300,
"navigation_ms": 8500,
"extraction_ms": 150
}
}
html: The HTML of the scraped page (null whenformat=markdown).markdown: Clean text in markdown format (null whenformat=html).statusCode: HTTP status code of the page.message: Message indicating the result of the operation.executionTime: Execution time in seconds.screenshot: Base64-encoded screenshot (only ifscreenshot: true).extracted_data: Extracted data (only ifextractis used).evaluate_results: Array with the results of eachevaluateaction executed, in order (only ifevaluateactions were used).network_requests: List of captured network requests (only ifnetwork_capturewas used).potentiallyBlockedByCaptcha: Boolean indicating whether the response appears to be a blocking or captcha page.timings: Object with detailed timing metrics. Always present, even on errors (with partial values). Useful for performance diagnostics.guidance: AI-friendly guidance object, present in every response:success: true only if real content was returned without blockserror_type: classified error (captcha,timeout,proxy_error,ssl_error,dns_error,empty_content,rate_limited,site_error)error_provider: specific CAPTCHA provider if detected (cloudflare,amazon,datadome,perimeterx,akamai,recaptcha,hcaptcha)next_steps: ordered list of what to try next (empty on success)suggested_request: ready-to-use request body for the next attempt (null on success or when retrying won't help)stop_reason: if present, do NOT retry — the issue is permanent (SSL error, DNS failure, 404, all bypass methods exhausted)credits_charged/credits_refunded: transparent credit tracking per request
Credit headers in each response:
X-Credits-Charged: 5 # credits charged for this request (1 simple, 5 browser)
X-Quota-Used: 1523 # credits used this month
X-Quota-Remaining: 48477 # credits remaining
Request Parameters
url (required)
The URL of the site to scrape.
"url": "https://example.com"
browser
Optional field, default value: false. If set to true,
a browser will be used for scraping, which is slower but more reliable
for dynamic sites or sites with JavaScript.
format
Optional field, default value: "html". Possible values: "html" or "markdown".
When "markdown" is used, the response returns clean text in the markdown field (instead of html). Scripts, styles, navigation, footers, and boilerplate are removed. Ideal for AI/LLM consumption and RAG pipelines.
"format": "markdown"
When format=markdown: the markdown field contains the text, html is null.
When format=html (default): the html field contains the HTML, markdown is null.
retry_on_block
Optional field, default value: false. When enabled (true), the server automatically retries up to 3 times with a different IP/fingerprint when it detects a CAPTCHA or 403 response.
"retry_on_block": true
Only credits for the successful attempt are charged. If all attempts fail, 1 set of credits is charged and the result of the last attempt is returned (with potentiallyBlockedByCaptcha: true).
Early CAPTCHA detection: when the browser detects a CAPTCHA or block on the page, it returns immediately in ~5 seconds (instead of waiting 60-85s until timeout). This applies both with and without retry_on_block enabled. With retry_on_block=true, each retry also benefits from early detection, achieving up to 3 attempts in ~15 seconds total.
screenshot
Optional field, default value: false. When set to true, the scraper will take a screenshot of the page and return it as a base64 encoded string.
use_proxy
This field is optional and can have 2 formats:
- string (Chooses a random proxy)
- Object (Advanced proxy configuration with retry system)
string
If use_proxy is set to "any", the scraper will use a proxy intelligently chosen by the system. If the value is "<country_code>", a proxy from a specific country can be selected.
"use_proxy": "any"
"use_proxy": "MX"
Object with retries
With this format, a retry system can be configured in case a request fails due to a problem with the selected proxy.
proxy: Same format as the string ("any"or<country code>).max_retries: Maximum number of retries.delay_seconds: Initial delay before retrying the scrape.backoff_factor: Multiplier applied after each retry.
Example:
"use_proxy": {
"proxy": "US",
"max_retries": 3,
"delay_seconds": 1,
"backoff_factor": 2
}
With the example above, the delay is applied as follows:
| Attempt | Delay |
|---|---|
| 1 | 1s |
| 2 | 2s |
| 3 | 4s |
Country proxy
The use_proxy parameter accepts an ISO 3166-1 alpha-2 country code (e.g., "US", "MX", "GB") to use a proxy from a specific country. This is useful when a site returns different content based on the visitor's geographic location.
"use_proxy": "US"
Possible values for use_proxy:
| Value | Behavior |
|---|---|
"any" | Proxy automatically chosen by the system, no country restriction |
"US", "MX", etc. | Proxy from a specific country (requires prior approval) |
| Field not sent | No proxy, direct request from the server |
Approval flow
To use proxies from a specific country, your account needs prior approval. The flow is:
- Check available countries:
GET /v1/proxy/countriesto see which countries are available. - Request access:
POST /v1/proxy/request-countrywith the desiredcountry_code. This creates a pending request. - Wait for approval: An administrator reviews and approves the request.
- Check status:
GET /v1/proxy/statusto see your approved and pending countries. - Use the proxy: Once approved, you can use
"use_proxy": "US"(or the approved country code) in your requests.
Error if not approved
If you try to use a proxy from a country without having approval, the response will include:
{
"html": "",
"statusCode": 403,
"message": "Country proxy 'US' is not approved for your account.",
"error_type": "country_proxy_not_approved",
"country_code": "US"
}
The error_type field with value "country_proxy_not_approved" allows detecting this case programmatically.
Country Proxy Endpoints
GET /v1/proxy/countries
Returns the list of available countries for proxies.
{
"countries": ["US", "GB", "MX", "BR", "AR", "DE", "FR", "ES"]
}
POST /v1/proxy/request-country
Requests access to proxies from a country. Creates a request that an administrator must approve.
Body:
{
"country_code": "US",
"reason": "We need to scrape prices on Amazon US"
}
Possible responses:
If the request was created (pending approval):
{
"status": "pending",
"country_code": "US",
"message": "Request submitted. An admin will review and approve your access."
}
If the country is already approved for your account:
{
"status": "already_approved",
"country_code": "US",
"message": "Country proxy 'US' is already approved for your account."
}
GET /v1/proxy/status
Shows the status of country proxy approvals for your account.
{
"client_id": "my-client",
"approved_countries": ["US", "GB"],
"pending_countries": ["MX"]
}
headers
Optional field. Allows sending custom HTTP headers with the scraping request. It is a key-value dictionary.
"headers": {
"Accept-Language": "en-US",
"X-Custom-Header": "my-value",
"ocp-apim-subscription-key": "abc123"
}
The headers are applied to the HTTP request made by the worker. Useful when a site requires specific headers to respond correctly (API keys, subscription tokens, etc.).
Note: This field only applies when
browserisfalse(simple HTTP scraping). In browser mode, use thelanguagefield to control the language.
cookies
Optional field. Key-value dictionary with cookies to add before scraping. Can be applied both with and without browser.
"cookies": {
"session_id": "abc123",
"consent": "accepted"
}
language
Optional field. String that allows requesting a specific language from the browser when scraping. It is recommended to match the country of the proxy being used.
The format follows the language-region structure (e.g., en-us, es-ar).
In cases where the language is specified in the URL itself, it is recommended to modify the URL instead of using this field. E.g., https://en.wikipedia.org/wiki/Main_Page.
window_size
Optional field. Allows configuring the browser window size. Format: "width,height".
"window_size": "1920,1080"
network_capture
Optional field. Only available when browser is true. Allows capturing network requests made by the page during scraping. Useful for discovering internal APIs, XHR/fetch endpoints, or understanding what resources a site loads.
resource_types(optional): Array of resource types to capture. Valid values:document,stylesheet,image,media,font,script,xhr,fetch,eventsource,websocket,manifest,other. Ifnullor omitted, captures everything.listen_after_load_ms(optional): Extra milliseconds to keep listening for requests after the page finishes loading. Maximum 10000 ms. Useful for capturing requests that fire in the background.
"network_capture": {
"resource_types": ["xhr", "fetch"],
"listen_after_load_ms": 3000
}
The captured requests are returned in the network_requests field of the response. Each entry contains:
url: URL of the request.method: HTTP method (GET,POST, etc.).resource_type: Resource type (xhr,fetch,document, etc.).status: HTTP status code of the response (can benullif the request did not complete).content_type: Content-Type of the response (without parameters, e.g.,application/json).
Example response with network_requests:
{
"html": "...",
"statusCode": 200,
"network_requests": [
{
"url": "https://api.example.com/v2/products?page=1",
"method": "GET",
"resource_type": "fetch",
"status": 200,
"content_type": "application/json"
}
]
}
Actions
actions is an optional field that allows interacting with the page before performing the scraping.
It accepts an array of objects where each one must have a type field.
Can only be used when browser is set to true.
click
Click on an element.
selector: XPath selector or CSS selector (withcss:prefix)wait_for_navigation(optional): Iftrue, waits for navigation to complete after the click. Useful when the page changes URL.
{
"type": "click",
"selector": "css:button[type='submit']",
"wait_for_navigation": true
}
input
Fill a text input field.
selector: XPath selector or CSS selectortext: The text to enter
{
"type": "input",
"selector": "//input[@name='search']",
"text": "search query"
}
select
Select an option from a dropdown menu.
selector: XPath selector or CSS selectorvalue: The value to select
{
"type": "select",
"selector": "css:select#country",
"value": "AR"
}
key-press
Press a key.
key: The key to press. Combinations are accepted. E.g.,"Shift+O","Enter","Tab".
{
"type": "key-press",
"key": "Enter"
}
wait-for-selector
Wait for an element to appear on the page.
selector: XPath selector or CSS selectortime: Maximum wait time in milliseconds
{
"type": "wait-for-selector",
"selector": "css:.results-loaded",
"time": 5000
}
wait-for-timeout
Wait a fixed amount of time before continuing to the next action.
time: Wait time in milliseconds
{
"type": "wait-for-timeout",
"time": 3000
}
collect
Extract data from the current page and accumulate it in memory. Especially useful inside while loops to collect data as you paginate or load more results.
extract: Dictionary of selectors. Same format as theextractparameter of the main request.
{
"type": "collect",
"extract": {
"product_names": {
"selector": "css:.product-title",
"multiple": true
},
"prices": {
"selector": "css:.price",
"multiple": true
}
}
}
When using multiple: true, results accumulate between loop iterations (the list is extended). Without multiple, the value is overwritten on each iteration.
evaluate
Execute arbitrary JavaScript code in the browser page context. Results accumulate in the evaluate_results array of the response (one per each evaluate action).
script: JavaScript code to execute. Can be a simple expression or an async function.timeout(optional): Maximum wait time in milliseconds for JS execution (default: 30000).
Simple expression
{
"type": "evaluate",
"script": "document.title"
}
Async function (internal fetch)
Useful for triggering AJAX forms or calling internal endpoints with the page's session cookies:
{
"type": "evaluate",
"script": "(async () => { const r = await fetch('/api/data'); return await r.json(); })()",
"timeout": 15000
}
Full example — AJAX form with hidden fields
Typical sequence for a form that POSTs via JavaScript:
{
"url": "https://example.com/booking",
"browser": true,
"actions": [
{
"type": "wait-for-selector",
"selector": "css:#bookingForm",
"time": 5000
},
{
"type": "input",
"selector": "css:#destination",
"text": "Buenos Aires"
},
{
"type": "evaluate",
"script": "document.querySelector('#checkin').value = '2026-05-01'"
},
{
"type": "evaluate",
"script": "(async () => { const form = document.querySelector('#bookingForm'); const data = new URLSearchParams(new FormData(form)); const r = await fetch('/search', {method: 'POST', headers: {'Content-Type': 'application/x-www-form-urlencoded'}, body: data.toString()}); return r.status; })()"
},
{
"type": "wait-for-timeout",
"time": 2000
}
]
}
The response will include:
{
"evaluate_results": ["2026-05-01", 200],
...
}
Each evaluate execution adds an element to the evaluate_results array in the order they were executed.
If the script throws an error, the result will be an object {"error": "error message"} instead of the returned value.
Selectors
The selector is XPath by default, but can be changed to CSS by using the css: prefix before the selector.
"selector": "//div[@class='product']"
"selector": "css:div.product"
Loops
while
Control structure that repeats a sequence of actions while a condition is true, or until a maximum number of iterations is reached.
condition: Condition to continue iterating.actions: List of actions to execute on each iteration (see Actions).max_iterations: Maximum number of allowed iterations.
Accepted conditions:
selector-visible: Iterations continue while the selector is visible on the page.selector-invisible: Iterations continue while the selector is NOT visible on the page.
{
"type": "selector-visible",
"selector": "css:.load-more"
}
Full example — click "Load more" until the button disappears:
{
"url": "https://example.com/products",
"browser": true,
"actions": [
{
"type": "while",
"condition": {
"type": "selector-visible",
"selector": "css:.load-more-button"
},
"actions": [
{
"type": "click",
"selector": "css:.load-more-button"
},
{
"type": "wait-for-timeout",
"time": 2000
}
],
"max_iterations": 10
}
]
}
Example with collect — paginate and accumulate data:
{
"url": "https://example.com/products",
"browser": true,
"actions": [
{
"type": "while",
"condition": {
"type": "selector-visible",
"selector": "css:button.next-page"
},
"actions": [
{
"type": "collect",
"extract": {
"titles": {
"selector": "css:h3.product-name",
"multiple": true
}
}
},
{
"type": "click",
"selector": "css:button.next-page"
},
{
"type": "wait-for-timeout",
"time": 1500
}
],
"max_iterations": 20
}
]
}
Extract
extract is an optional field of the main request that allows extracting specific data from the page using CSS selectors. Only works with browser: true.
Supports two formats:
Simple format (string)
"extract": {
"title": "css:h1",
"description": "css:meta[name='description']"
}
Advanced format (object)
"extract": {
"all_prices": {
"selector": "css:.price",
"multiple": true
},
"main_image_src": {
"selector": "css:img.hero",
"attribute": "src"
},
"product_classes": {
"selector": "css:h3.product-title",
"multiple": true,
"attribute": "class"
}
}
selector: CSS selector (withcss:prefix)multiple(optional): Iftrue, returns all elements matching the selector as an array. Iffalse(default), returns only the first one.attribute(optional): Extracts the value of an element's attribute instead of its text. E.g.,"href","src","class".
The extracted data is returned in the extracted_data field of the response.
HttpMethod
Optional field, only available when browser is false (simple scraping).
GET: Performs a standard GET request. Same behavior as omitting this field. Does not accept a payload.POST: Performs a POST request. The payload is optional.
"http_method": {
"method": "get"
}
"http_method": {
"method": "post",
"payload": {"category": "dogs", "page": 1}
}
Response
The successful endpoint response has the following format:
{
"html": "<html>...</html>",
"statusCode": 200,
"message": "OK",
"screenshot": null,
"executionTime": 1.23,
"extracted_data": null,
"potentiallyBlockedByCaptcha": false
}
potentiallyBlockedByCaptcha
Boolean field that indicates whether the received response appears to be a blocking or captcha page. Useful for easily detecting when a site is blocking scraper access without needing to manually analyze the HTML.
It is marked as true in the following cases:
- The server responds with a
403,429, or503status code. - The response HTML contains typical blocking signals, such as:
- Phrases like "Are you a human?", "I'm not a robot", "Verify you are human"
- Presence of captcha services:
captcha,reCAPTCHA,hCAPTCHA - Cloudflare pages: "Just a moment...", "Checking your browser"
- Unusual traffic or suspicious activity messages
Usage example:
response = requests.post("/v1/sync/scrape", json={"url": "https://example.com", "browser": False})
data = response.json()
if data["potentiallyBlockedByCaptcha"]:
print("The site may be blocking access.")
This field is a heuristic, not a guaranteed detection. A false does not ensure the page is not blocked, and a true does not guarantee it is — it only indicates that common blocking signals were detected.
POST /v1/sync/download
This endpoint downloads a file directly from a URL (PDF, JPG image, PNG, etc.) and returns its content encoded in base64 along with the detected content type.
{
"url": "https://example.com/document.pdf",
"use_proxy": "any"
}
Response
{
"content": "JVBERi0xLjQK...",
"contentType": "application/pdf",
"statusCode": 200,
"message": "OK",
"executionTime": 0.312
}
content: The file content encoded in base64.contentType: MIME type of the file returned by the server (e.g.,application/pdf,image/png,image/jpeg).statusCode: HTTP status code of the file server response.message: Result message or error description.executionTime: Execution time in seconds.
Parameters
url (required)
The direct URL to the file to download.
"url": "https://example.com/report.pdf"
use_proxy
Optional field. Same format as in /v1/sync/scrape. Useful when the file server restricts access by source IP.
"use_proxy": "any"
GET /v1/sync/metrics
This endpoint retrieves global metrics about the API's operation. It can receive 2 optional parameters and returns metrics in JSON format:
{
"date": "2026-02-11",
"scrape_type": {
"browser": {
"total": 21,
"success": 21,
"failed": 0,
"success_rate": 100,
"percentage_of_total": 52.5
},
"simple": {
"total": 19,
"success": 18,
"failed": 1,
"success_rate": 94.74,
"percentage_of_total": 47.5
},
"total_requests": 40
}
}
Parameters
- date: Date or date range. Accepted formats:
YYYY-MM-DD— a specific dayYYYY-MM-DD:YYYY-MM-DD— date range
- metric: Allows requesting data about a specific type. Accepted values:
[url, proxy, api_codes, page_codes, exe_time, scrape_type]
GET /v1/sync/client-metrics
Endpoint to retrieve per-client usage metrics. Each authenticated client sees only their own metrics. Administrators can see metrics for all clients.
Parameters
- date: Date or range. Accepted formats:
YYYY-MM-DD— a specific day (default: today)YYYY-MM-DD:YYYY-MM-DD— date rangeYYYY-MM— full month
- client (admin only): Filter by a specific client_id
- hourly: If
true, includes hourly breakdown (single day only) - detail: If
"urls", includes a per-domain breakdown with fieldsbrowser_success,browser_failed,simple_success,simple_failed
Example
curl 'https://api.scrapingpros.com/v1/sync/client-metrics?date=2026-03-25&hourly=true' \
-H 'Authorization: Bearer <API-KEY>'
Example with per-domain breakdown
curl 'https://api.scrapingpros.com/v1/sync/client-metrics?date=2026-03&detail=urls' \
-H 'Authorization: Bearer <API-KEY>'
GET /v1/sync/billing
Billing endpoint that returns a precise per-client usage summary, calculated from MySQL (not a Redis approximation). Ideal for generating monthly consumption reports.
Parameters
- month: Month in
YYYY-MMformat (default: current month) - client (admin only): Filter by a specific client_id
- detail: If
"urls", includes a per-domain breakdown (by_url)
Example
curl 'https://api.scrapingpros.com/v1/sync/billing?month=2026-03' \
-H 'Authorization: Bearer <API-KEY>'
Response
{
"month": "2026-03",
"clients": {
"my-client": {
"simple_success": 15000,
"simple_failed": 200,
"simple_total": 15200,
"browser_success": 8000,
"browser_failed": 150,
"browser_total": 8150,
"total_requests": 23350,
"total_success": 23000,
"total_failed": 350
}
}
}
Response with detail=urls
When detail=urls is passed, each client includes a by_url field with the per-domain breakdown:
{
"month": "2026-03",
"clients": {
"my-client": {
"simple_success": 15000,
"simple_failed": 200,
"simple_total": 15200,
"browser_success": 8000,
"browser_failed": 150,
"browser_total": 8150,
"total_requests": 23350,
"total_success": 23000,
"total_failed": 350,
"by_url": {
"example.com": {
"simple_success": 10000,
"simple_failed": 100,
"browser_success": 5000,
"browser_failed": 50,
"total": 15150
},
"other-site.com": {
"simple_success": 5000,
"simple_failed": 100,
"browser_success": 3000,
"browser_failed": 100,
"total": 8200
}
}
}
}
}
simple_success/simple_failed: Successful/failed requests without browser.browser_success/browser_failed: Successful/failed requests with browser.simple_total/browser_total: Totals by type.total_requests: Total of all requests.total_success/total_failed: Success/failure totals.by_url: Per-domain breakdown (only ifdetail=urls).
GET /v1/health
Health check endpoint for monitoring and observability. Does not require authentication.
Checks the status of all critical API components: Redis, MySQL, proxies (internal and external), workers, and queues.
Example
curl 'https://api.scrapingpros.com/v1/health'
Response
{
"status": "healthy",
"checks": {
"redis": {
"status": "ok",
"latency_ms": 1.2
},
"mysql": {
"status": "ok",
"latency_ms": 3.5
},
"proxies_api": {
"internal": {
"status": "ok",
"latency_ms": 15.0
},
"external": {
"status": "ok",
"latency_ms": 120.0
}
},
"workers": {
"sync": {"up": 50, "expected": 50},
"async": {"up": 6, "expected": 6}
},
"queues": {
"pending_jobs": 0,
"async_scheduler": 0
}
},
"uptime_seconds": 86400
}
Possible statuses
| Status | Meaning |
|---|---|
healthy | All components functioning correctly |
degraded | Some non-critical component has issues (external proxies, workers below 90%, queue with >500 jobs) |
unhealthy | A critical component is down (Redis, MySQL, or internal proxies) |
Interactive Examples
Try the endpoints directly from this page. Edit the body and click Run.
Simple scraping (without browser)
/v1/sync/scrapeGet the HTML of example.com without browserMarkdown output (for AI/LLM)
/v1/sync/scrapeGet clean text in markdown formatWith browser
/v1/sync/scrapeScrape with headless browserRetry on block (anti-CAPTCHA)
/v1/sync/scrapeAuto-retry with different IP if blockedWith browser + proxy + screenshot
/v1/sync/scrapeBrowser with proxy and screenshot captureWith browser actions
/v1/sync/scrapeClick on a link and waitData extraction
/v1/sync/scrapeExtract title and links from a pagePOST request with payload
/v1/sync/scrapeHTTP POST with JSON body (without browser)Global metrics
/v1/sync/metrics?metric=scrape_typeView scrape_type metrics for todayClient metrics
/v1/sync/client-metricsView metrics for the authenticated clientDownload a PDF
/v1/sync/downloadDownload a PDF and get its content in base64Download an image
/v1/sync/downloadDownload a PNG imageExecute JavaScript on the page (evaluate)
/v1/sync/scrapeExecute JS and get the page titleNetwork request capture
/v1/sync/scrapeCapture XHR and fetch requests from a pageCurrent month billing
/v1/sync/billingView billing summary for the current monthBilling with per-domain breakdown
/v1/sync/billing?detail=urlsView billing with per-domain breakdownList available proxy countries
/v1/proxy/countriesSee which countries have available proxiesRequest access to a country proxy
/v1/proxy/request-countryRequest approval to use proxies from a countryView proxy approval status
/v1/proxy/statusView approved and pending countries for your accountHealth check
/v1/healthCheck the status of all API components (no authentication required)