Skip to main content

Asynchronous Endpoints

The asynchronous endpoints allow you to group multiple requests into a collection and execute them in the background. Ideal for scraping large volumes of URLs.


POST /v1/async/collections

Creates a new request collection.

Request

curl -X POST \
'https://api.scrapingpros.com/v1/async/collections' \
-H 'Authorization: Bearer <API-KEY>' \
-H 'Idempotency-Key: 7f3a2b1e-4c5d-4f1a-8b2c-9d4e5f6a7b8c' \
-H 'Content-Type: application/json' \
-d '{
"name": "My collection",
"requests": [
{
"url": "https://example.com",
"custom_id": "tour_12345",
"browser": true
},
{
"url": "https://example.org",
"custom_id": "tour_12346",
"use_proxy": "any"
}
]
}'

Headers

HeaderRequiredDescription
AuthorizationYesBearer <API-KEY>
Idempotency-KeyNoClient-generated unique key (UUID recommended). Lets you safely retry the request after a network timeout without creating a duplicate collection. See Idempotency below.

Body

FieldTypeRequiredDescription
namestringYesName of the collection
requestsarrayNoList of requests. Same format as the /v1/sync/scrape body — each request may include custom_id for traceability

custom_id (optional, per request)

Client-supplied identifier (max 255 chars) that is echoed back in job listings, result payloads, and webhooks. Lets you correlate jobs to your own domain objects without matching by URL.

Deduplication by (url, custom_id): two requests with the same URL but different custom_id are considered distinct jobs (useful when the same URL feeds multiple pipelines — e.g. {url, "english"} and {url, "spanish"}). Two requests with the same URL and the same custom_id (or both without custom_id) are deduplicated — only the first one is kept and duplicates_skipped is incremented in the response.

Response (201)

{
"id": "c38b0bcf-cb7c-4728-8704-2c2e267dcff9",
"name": "My collection",
"message": "Collection created successfully.",
"duplicates_skipped": 0,
"blocked_urls": []
}

Response when some URLs were rejected

If one or more URLs fail validation (private/internal IPs, unsupported protocols, malformed input), the collection is still created with the URLs that passed, and the rejected ones come back in blocked_urls. You can use that list to fix the inputs and re-submit only the failures, without having to parse the message.

{
"id": "c38b0bcf-cb7c-4728-8704-2c2e267dcff9",
"name": "My collection",
"message": "Collection created successfully. 2 URL(s) blocked (SSRF protection).",
"duplicates_skipped": 0,
"blocked_urls": [
{
"index": 1,
"url": "http://192.168.1.1/admin",
"reason": "private_ip",
"message": "URL resolved to a private or internal IP."
},
{
"index": 2,
"url": "ftp://example.com/file",
"reason": "invalid_protocol",
"message": "Only http and https URLs are accepted."
}
]
}

reason is one of: private_ip, dns_failed, blocked_hostname, invalid_protocol, invalid_port, malformed_url, or blocked (generic fallback). index matches the position of the URL in your original requests array.

Idempotency

Network timeouts during submit are common with large batches. Without protection, a client retry can create a second collection and double the cost. Send an Idempotency-Key header — any UUID per logical operation — to make the retry safe:

ScenarioResult
First request with the keyCollection is created normally. Response header Idempotency-Replayed: false.
Retry with same key + same body within 24 hReturns the original id and response, without re-processing or re-charging. Header Idempotency-Replayed: true.
Same key but different body422"Idempotency-Key reused with a different payload." Use a new key for a different operation.
Two requests with the same key arriving in parallelThe second one waits (up to 30 s) and replays the first one's response.

The key is stored for 24 h, scoped to your client. It must be ≤ 200 chars and contain no whitespace or :. UUIDs (e.g. generated with uuid.uuid4()) are recommended.

If you don't send the header, behavior is unchanged — each POST creates a new collection.


GET /v1/async/collections

Lists collections, optionally filtered by name and creation time.

Request

# All collections
curl 'https://api.scrapingpros.com/v1/async/collections' \
-H 'Authorization: Bearer <API-KEY>'

# Exact-name match (e.g. recover a collection after a timeout)
curl 'https://api.scrapingpros.com/v1/async/collections?name=daily-2026-04-30' \
-H 'Authorization: Bearer <API-KEY>'

# All collections that start with `daily-`
curl 'https://api.scrapingpros.com/v1/async/collections?name_prefix=daily-' \
-H 'Authorization: Bearer <API-KEY>'

# All collections created in the last hour
curl 'https://api.scrapingpros.com/v1/async/collections?since=2026-04-30T11:00:00Z' \
-H 'Authorization: Bearer <API-KEY>'

Query parameters

ParamTypeDescription
namestringExact match on the collection name.
name_prefixstringReturns collections whose name starts with this prefix.
sinceISO 8601Returns collections created at or after this timestamp. Collections created before this field was tracked (legacy) are excluded when since is set.

All three can be combined; they apply with AND semantics.

Response (200)

[
{
"id": "c38b0bcf-cb7c-4728-8704-2c2e267dcff9",
"name": "My collection",
"created_at": 1777853200.5,
"updated_at": 1777853200.5
},
{
"id": "11d6f8af-9a54-4b6c-b793-e12b77c86159",
"name": "Another collection",
"created_at": 1777851000.1,
"updated_at": 1777851500.3
}
]

created_at and updated_at are epoch seconds (UTC). They are null for collections created before this field was tracked — clients should tolerate the null.


GET /v1/async/collections/{collection_id}

Gets a specific collection by its ID.

Request

curl 'https://api.scrapingpros.com/v1/async/collections/c38b0bcf-cb7c-4728-8704-2c2e267dcff9' \
-H 'Authorization: Bearer <API-KEY>'

Response (200)

{
"id": "c38b0bcf-cb7c-4728-8704-2c2e267dcff9",
"name": "My collection",
"created_at": 1777853200.5,
"updated_at": 1777853200.5
}

PUT /v1/async/collections/{collection_id}

Updates an existing collection. Both the name and the request list can be modified. If a new request list is sent, it replaces the previous one.

Request

curl -X PUT \
'https://api.scrapingpros.com/v1/async/collections/c38b0bcf-cb7c-4728-8704-2c2e267dcff9' \
-H 'Authorization: Bearer <API-KEY>' \
-H 'Content-Type: application/json' \
-d '{
"name": "Updated collection",
"requests": [
{
"url": "https://new-example.com",
"browser": true
}
]
}'

Response (200)

{
"id": "c38b0bcf-cb7c-4728-8704-2c2e267dcff9",
"name": "Updated collection",
"message": "Collection updated successfully."
}

POST /v1/async/collections/{collection_id}/run

Executes all requests in a collection asynchronously. A collection can be executed multiple times.

Request

curl -X POST \
'https://api.scrapingpros.com/v1/async/collections/c38b0bcf-cb7c-4728-8704-2c2e267dcff9/run' \
-H 'Authorization: Bearer <API-KEY>'

No body required.

Response (201)

{
"run_id": "9b64941a-4545-4c57-9174-c70e781d9192",
"status": "in_progress",
"total_requests": 2,
"success_requests": 0,
"failed_requests": 0,
"timeout_requests": 0,
"collection_id": "c38b0bcf-cb7c-4728-8704-2c2e267dcff9"
}

GET /v1/async/collections/{collection_id}/runs

Lists every run that has been executed against a given collection, newest first. Useful when you persist a collection_id and need to enumerate its history (current run, previous re-runs, audit trail), or when you want to reattach to a live run after the original POST /run request lost its response (network timeout, etc.).

Request

# All runs of this collection
curl 'https://api.scrapingpros.com/v1/async/collections/c38b0bcf-.../runs' \
-H 'Authorization: Bearer <API-KEY>'

# Just the live run (helpful after a submit timeout — you keep `collection_id`,
# you fetch the run that's already executing)
curl 'https://api.scrapingpros.com/v1/async/collections/c38b0bcf-.../runs?status_filter=in_progress' \
-H 'Authorization: Bearer <API-KEY>'

Query parameters

ParamTypeDescription
status_filterin_progress | completedFilter by run status. Omit for all runs.

Response (200)

{
"items": [
{
"run_id": "9b64941a-4545-4c57-9174-c70e781d9192",
"status": "in_progress",
"total_requests": 100,
"success_requests": 73,
"failed_requests": 5,
"timeout_requests": 0,
"collection_id": "c38b0bcf-cb7c-4728-8704-2c2e267dcff9",
"callback_url": null,
"callback_status": null,
"created_at": 1777853217.82
},
{
"run_id": "8c9bafe2-...",
"status": "completed",
"total_requests": 100,
"success_requests": 99,
"failed_requests": 1,
"timeout_requests": 0,
"collection_id": "c38b0bcf-cb7c-4728-8704-2c2e267dcff9",
"callback_url": "https://example.com/webhook",
"callback_status": "sent",
"created_at": 1777840000.10
}
],
"total": 2
}

Order: newest first (created_at desc). Runs created before this field was tracked sort to the bottom with created_at: null.


GET /v1/async/collections/{collection_id}/runs/{run_id}

Queries the status and result of an execution. Call periodically until status is completed.

Request

curl 'https://api.scrapingpros.com/v1/async/collections/c38b0bcf-cb7c-4728-8704-2c2e267dcff9/runs/9b64941a-4545-4c57-9174-c70e781d9192' \
-H 'Authorization: Bearer <API-KEY>'

Response -- in progress (200)

{
"run_id": "9b64941a-4545-4c57-9174-c70e781d9192",
"status": "in_progress",
"total_requests": 2,
"success_requests": 1,
"failed_requests": 0,
"timeout_requests": 0,
"collection_id": "c38b0bcf-cb7c-4728-8704-2c2e267dcff9"
}

Response -- completed without errors (200)

{
"run_id": "9b64941a-4545-4c57-9174-c70e781d9192",
"status": "completed",
"total_requests": 2,
"success_requests": 2,
"failed_requests": 0,
"timeout_requests": 0,
"collection_id": "c38b0bcf-cb7c-4728-8704-2c2e267dcff9",
"failed_jobs": []
}

Response -- completed with errors (200)

{
"run_id": "9b64941a-4545-4c57-9174-c70e781d9192",
"status": "completed",
"total_requests": 3,
"success_requests": 2,
"failed_requests": 1,
"timeout_requests": 0,
"collection_id": "c38b0bcf-cb7c-4728-8704-2c2e267dcff9",
"failed_jobs": [
{
"job_id": "e3a1b2c4-...",
"url": "https://example.com/page-that-failed",
"custom_id": "tour_12346",
"status": "failed",
"error": "Connection timeout"
}
]
}

Response Fields

FieldTypeDescription
run_idstring (UUID)Unique identifier of the execution
statusstringStatus: in_progress or completed
total_requestsintegerTotal requests in the collection
success_requestsintegerRequests that delivered usable content: worker completed, target responded with HTTP 2xx, and no block signal (captcha, softblock, etc.) was detected.
failed_requestsintegerRequests that did not deliver usable content. Includes: worker failures, 4xx/5xx from the target, captcha pages, and any response flagged by the HTML validator.
timeout_requestsintegerRequests that timed out at the worker level (never got a response from the target).
success_criterionobject | nullDeclarative description of the active classification rule. Fields: version (e.g. content_success_v1) and rules (human-readable predicates). Lets clients pin the rule version in their SDK and detect silent policy changes.
Counters measure content success

success_requests is what the client gets back, not worker-level completion. A job that finished but whose target returned a 500 or a captcha page counts as failed. This aligns the counter with what you're paying for — HTML you can use.

For worker-level health (how many jobs finished without infrastructure failure, regardless of content), use the per-job listing with status_filter=completed and count the items yourself.

| collection_id | string (UUID) | ID of the executed collection | | failed_jobs | array | List of failed or timed-out jobs, with their URL and error reason |

Where are the scraping results?

The run status endpoint returns a summary (total / success / failed counters) but does not include the scraped content inline. To retrieve per-job data use the two endpoints below:

  • List all jobs of a run (cursor-paginated, with URL and timings): GET /v1/async/collections/{collection_id}/runs/{run_id}/jobs
  • Full result of a specific job (HTML/JSON body): GET /v1/async/collections/{collection_id}/runs/{run_id}/jobs/{job_id}/result

Result bodies are retained for 48 hours after job completion. Metadata in the listing endpoint is retained for 90 days. The list of job_ids on the run itself is available for the lifetime of the run (it falls back to the durable record if the cache misses), so you can always enumerate the jobs of a run regardless of how long ago it completed.


GET /v1/async/collections/{collection_id}/runs/{run_id}/jobs

Lists all jobs belonging to a run, with cursor-based pagination. Returns URL, status, timings, custom_id, and validator metadata for every job — without the (potentially large) HTML body.

Typical use: iterate over completed jobs to pick which ones to download full result for, or build a custom dashboard.

Request

curl 'https://api.scrapingpros.com/v1/async/collections/c38b0bcf-.../runs/9b64941a-.../jobs?limit=100' \
-H 'Authorization: Bearer <API-KEY>'

Query parameters

ParamTypeDefaultDescription
cursorstring(none)Opaque cursor returned by the previous page. Omit on the first call. Encoding depends on order_by (see below) — passing a cursor generated with a different order_by responds with 400.
limitinteger100Page size. Min 1, max 1000.
status_filterstring / CSV(none)Filter by status. Accepts a single value or a CSV for multiple: completed, failed, timeout, processing. Example: status_filter=completed,failed,timeout. Omit to return all.
since_completed_atISO 8601 string(none)When set, returns only rows with completed_at strictly greater. Accepts Z, +00:00, or naive (treated as UTC). Rows with NULL completed_at (e.g. still processing) are excluded. Useful for incremental polling to avoid re-fetching already-seen completions.
order_byid | completed_atidSort order. id preserves legacy behaviour (insertion order). completed_at sorts by when the job finished — ideal for streaming completions as they happen.
order_dirasc | descascDirection (only honored for order_by=completed_at; order_by=id is always ASC).

Cursor encoding

  • order_by=id: base64(str(auto_increment_id)) — back-compat. Stable across deploys.
  • order_by=completed_at: base64("<iso_ts>|<job_public_id>"). The tuple (completed_at, job_public_id) is a stable position in the ordering; the job_public_id breaks ties when two jobs complete at the same millisecond.

Cursors have no TTL. A cursor stays valid for as long as the partition containing newer rows exists (90-day retention). Once the run's partition is dropped, the cursor returns an empty page without error.

Response (200)

{
"items": [
{
"job_public_id": "e3a1b2c4-...",
"run_public_id": "9b64941a-...",
"collection_id": "c38b0bcf-...",
"status": "completed",
"url": "https://example.com/tours/123",
"custom_id": "tour_12345",
"url_truncated": false,
"status_code": 200,
"message": null,
"queued_at": "2026-04-23T12:00:00.123",
"started_at": "2026-04-23T12:00:02.267",
"completed_at": "2026-04-23T12:00:03.637",
"execution_time_ms": 1370,
"retries_attempted": 0,
"block_reason": null,
"protection_stack": ["cloudflare"],
"rule_hits": []
}
],
"cursor_next": "MzQ=",
"has_more": true
}

Response fields

FieldTypeDescription
job_public_idUUIDJob identifier
urlstringTarget URL of the scrape. Truncated to 2048 chars — see url_truncated
url_truncatedbooltrue if the original URL exceeded the storage column size and was cut. Compare URLs with care if this is true
custom_idstring | nullClient-supplied identifier, echoed from the collection request
statusstringcompleted / failed / timeout / processing
status_codeinteger | nullHTTP status returned by the target site
queued_at / started_at / completed_atISO 8601Lifecycle timestamps (UTC, millisecond precision)
execution_time_msintegerEnd-to-end duration in ms
retries_attemptedinteger0 when the first attempt succeeded
block_reasonstring | nullPopulated by the HTML validator when content is flagged (captcha, softblock, shell, hard_block, etc.)
protection_stackarray of stringsAnti-bot providers detected on the target (e.g. ["cloudflare", "datadome"])
rule_hitsarray of stringsValidator rules that fired (debug/diagnostic; may be empty in success cases)
is_successbool | nullPre-computed content-success verdict. true = the client received usable HTML (2xx + no captcha/block). false = worker completed but content is not usable (4xx/5xx/captcha) OR worker failed/timed out. null = verdict not computed (pre-migration rows). Sum of rows with is_success=true equals run.success_requests by construction — classify jobs on the client without replicating the rule.
cursor_nextstring | nullCursor to request the next page. null when there are no more items
has_morebooltrue when additional pages exist

Paginating the full run

import requests

BASE = "https://api.scrapingpros.com"
H = {"Authorization": "Bearer <API-KEY>"}
url = f"{BASE}/v1/async/collections/{cid}/runs/{rid}/jobs"

cursor = None
while True:
params = {"limit": 500}
if cursor:
params["cursor"] = cursor
page = requests.get(url, headers=H, params=params).json()
for job in page["items"]:
process(job) # e.g. store in your DB, queue HTML download, etc.
if not page["has_more"]:
break
cursor = page["cursor_next"]

There is a ~5-second lag between a job completing and appearing in this listing (the metadata flusher runs on a 5s tick). For strict real-time notification, use webhooks (coming in a future release).

Efficient incremental polling

When iterating a large batch (thousands of URLs), combine order_by=completed_at, since_completed_at, and status_filter to fetch only new completions since the last poll:

import requests
from datetime import datetime, timezone

BASE = "https://api.scrapingpros.com"
H = {"Authorization": "Bearer <API-KEY>"}

# Track the newest completed_at we have already consumed.
last_seen = None

while True:
params = {
"order_by": "completed_at",
"status_filter": "completed,failed,timeout",
"limit": 1000,
}
if last_seen:
params["since_completed_at"] = last_seen

r = requests.get(
f"{BASE}/v1/async/collections/{cid}/runs/{rid}/jobs",
headers=H, params=params,
)
page = r.json()
for job in page["items"]:
handle(job)
if job["completed_at"] and (not last_seen or job["completed_at"] > last_seen):
last_seen = job["completed_at"]
if not page.get("has_more"):
break # caught up; poll again after a short sleep

For a batch of 50 000 URLs this reduces polling cost from ~50 API calls per tick (full pagination + client-side dedup) to ~1 call per tick.


GET /v1/async/collections/{collection_id}/runs/{run_id}/jobs/{job_id}/result

Returns the full scraping result of a single job — same shape as the response of POST /v1/sync/scrape, plus url and custom_id for traceability.

Request

curl 'https://api.scrapingpros.com/v1/async/collections/c38b0bcf-.../runs/9b64941a-.../jobs/e3a1b2c4-.../result' \
-H 'Authorization: Bearer <API-KEY>'

Response (200)

{
"url": "https://example.com/tours/123",
"custom_id": "tour_12345",
"status": "completed",
"html": "<!doctype html> ...",
"statusCode": 200,
"extracted_data": null,
"timings": { "total_ms": 1370, "navigation_ms": 890 },
"executionTime": 1.37,
"potentiallyBlockedByCaptcha": false,
"guidance": {
"success": true,
"error_type": null,
"next_steps": [],
"suggested_request": null,
"stop_reason": null
}
}

The response shape matches POST /v1/sync/scrapeguidance in particular is now populated in the async /result endpoint so clients have the same post-hoc diagnostics they already get in sync mode (why a request failed, which parameters to adjust, whether to retry).

Retention

HTML bodies are retained for 48 hours after job completion. After that window, the listing metadata remains for 90 days but this endpoint returns 404. If you need longer retention, download the result once the job completes (or subscribe to the run callback_url) and persist on your side.

When the result is not available (404)

If the job completes successfully you'll receive 200 with the result body. If the body is unavailable, the response is 404 with a structured detail that tells you which kind of unavailable it is, so you can react accordingly:

HTTP 404
{
"detail": {
"error_code": "result_lost",
"message": "Job result is unavailable due to a service incident during the completion window. Contact support if the data is critical — it may qualify for refund.",
"completed_at": "2026-04-30T12:34:56Z",
"age_hours": 0.4
}
}
error_codeWhat it meansSuggested action
result_pendingThe job is still in flight (or the worker did not store a result yet).Retry shortly — typical jobs complete in seconds.
result_expiredThe job completed more than 24 h ago. The metadata is still in the listing endpoint, but the body has been pruned.Re-run the collection if you still need the data.
result_lostThe job completed within the last 24 h, but the body is unavailable.Contact support — may qualify for a refund.
job_id_invalidWe have no record of this job_id for the given run_id.Verify the IDs you're using; this typically points to a client bug.

The detail field has been a string in older versions of the API and may keep being a string in some edge paths (e.g. when the upstream lookup itself times out). Robust clients should accept both shapes.

Example: polling until completion and downloading results

import time, requests

BASE = "https://api.scrapingpros.com"
H = {"Authorization": "Bearer <API-KEY>"}
COLLECTION_ID = "c38b0bcf-cb7c-4728-8704-2c2e267dcff9"

# 1. Start the run
run = requests.post(
f"{BASE}/v1/async/collections/{COLLECTION_ID}/run",
headers=H
).json()
run_id = run["run_id"]

# 2. Poll the run status until completed
while True:
status = requests.get(
f"{BASE}/v1/async/collections/{COLLECTION_ID}/runs/{run_id}",
headers=H
).json()
print(f"Status: {status['status']} — "
f"{status['success_requests']}/{status['total_requests']} successful")
if status["status"] == "completed":
break
time.sleep(5)

# 3. Iterate jobs via cursor, download HTML, match by custom_id
cursor, results = None, {}
while True:
params = {"limit": 500, "status_filter": "completed"}
if cursor:
params["cursor"] = cursor
page = requests.get(
f"{BASE}/v1/async/collections/{COLLECTION_ID}/runs/{run_id}/jobs",
headers=H, params=params
).json()
for job in page["items"]:
# Download full body for jobs you care about
r = requests.get(
f"{BASE}/v1/async/collections/{COLLECTION_ID}/runs/{run_id}/jobs/{job['job_public_id']}/result",
headers=H
).json()
results[job["custom_id"] or job["url"]] = r["html"]
if not page["has_more"]:
break
cursor = page["cursor_next"]

# 4. (optional) Check failed jobs
if status.get("failed_jobs"):
print("Failed jobs:")
for job in status["failed_jobs"]:
print(f" - custom_id={job.get('custom_id')} url={job['url']}: {job['error']}")
Python SDK

The SDK (pip install scrapingpros>=0.3.0) offers client.iter_run_jobs(collection_id, run_id) as a generator that handles cursor pagination internally, plus typed models with datetime parsing for the timestamp fields.