Skip to main content

Collection

A collection is a group of requests that will be run asynchronously.

Endpoints

POST /v1/async/collections

Creates a new collection.

Once the collection is created, it can be run using the /collections/{collection_id}/run endpoint.

Returns a JSON object containing the ID of the new collection, the collection name, and a success message.

This endpoint requires a body with the following structure:

{
"name": "New collection",
"requests": [
{
"url": "www.example.com",
"custom_id": "tour_12345",
"browser": true,
"screenshot": false,
"actions": [
{
"type": "wait-for-timeout",
"time": 5000
}
]
}
],
"callback_url": "https://your-server.com/webhook"
}
  • name: Name of the project or collection.
  • requests: A list of requests. The parameters for this field are detailed in scrape. Each request may include an optional custom_id for traceability (see below).
  • callback_url (optional): URL to which a POST notification will be sent when a run of this collection completes. The webhook includes event, run_id, status, counters (total_requests, success_requests, failed_requests), and job_ids. Delivery is signed with HMAC-SHA256 (headers X-SP-Signature and X-SP-Timestamp). Can be overridden per run.

custom_id per request

Optional client-supplied identifier (max 255 characters) that is echoed back in job listings, per-job results, and the run completion webhook. It lets you correlate jobs to your own domain objects — for example, a tour ID, product SKU, or row in your database — without matching by URL.

Deduplication by (url, custom_id): two requests with the same URL but different custom_id are kept as distinct jobs (useful when the same URL drives multiple pipelines — e.g. different parsers, different languages). Two requests with the same URL and the same custom_id (or both without custom_id) are deduplicated — the response reports how many were skipped via duplicates_skipped.

{
"name": "Daily tours",
"requests": [
{ "url": "https://viator.com/tours/1", "custom_id": "en_pipeline", "browser": true },
{ "url": "https://viator.com/tours/1", "custom_id": "es_pipeline", "browser": true },
{ "url": "https://viator.com/tours/2", "custom_id": "en_pipeline", "browser": true }
]
}

Response:

{
"id": "c38b0bcf-...",
"name": "Daily tours",
"message": "Collection created successfully.",
"duplicates_skipped": 0,
"blocked_urls": []
}

Safe retries with Idempotency-Key

For large batches, it is common to lose the response to a network timeout. Without protection, your retry would create a second collection (and incur a second batch of charges). Send a unique Idempotency-Key header — typically a UUID generated per submit — to make the request safe to retry:

curl -X POST https://api.scrapingpros.com/v1/async/collections \
-H 'Authorization: Bearer <API-KEY>' \
-H 'Idempotency-Key: 7f3a2b1e-4c5d-4f1a-8b2c-9d4e5f6a7b8c' \
-H 'Content-Type: application/json' \
-d '{ "name": "...", "requests": [ ... ] }'

If the same key is sent again with the same body within 24 h, the API returns the original collection (header Idempotency-Replayed: true) without re-processing or re-charging. If the body is different, the API returns 422 so you know you reused a key by mistake. Without the header, behaviour is unchanged.

When some URLs are rejected

If one or more URLs fail validation (private/internal IPs, unsupported protocols, malformed input), the collection is still created with the URLs that passed, and the rejected ones come back in the response so you can fix and re-submit only those:

{
"id": "c38b0bcf-...",
"name": "Daily tours",
"message": "Collection created successfully. 2 URL(s) blocked (SSRF protection).",
"duplicates_skipped": 0,
"blocked_urls": [
{
"index": 1,
"url": "http://192.168.1.1/admin",
"reason": "private_ip",
"message": "URL resolved to a private or internal IP."
},
{
"index": 4,
"url": "ftp://example.com/file",
"reason": "invalid_protocol",
"message": "Only http and https URLs are accepted."
}
]
}

reason is one of: private_ip, dns_failed, blocked_hostname, invalid_protocol, invalid_port, malformed_url, or blocked.

GET /v1/async/collections

Lists collections, with optional filters.

# All collections
curl 'https://api.scrapingpros.com/v1/async/collections' \
-H 'Authorization: Bearer <API-KEY>'

# Exact-name match
curl 'https://api.scrapingpros.com/v1/async/collections?name=daily-2026-04-30' \
-H 'Authorization: Bearer <API-KEY>'

# All collections starting with `daily-`
curl 'https://api.scrapingpros.com/v1/async/collections?name_prefix=daily-' \
-H 'Authorization: Bearer <API-KEY>'

# Created at or after a timestamp (ISO 8601, UTC)
curl 'https://api.scrapingpros.com/v1/async/collections?since=2026-04-30T11:00:00Z' \
-H 'Authorization: Bearer <API-KEY>'

Filters can be combined with AND. Useful for recovery flows — e.g. after a submit timeout, look up "the collection I just created with that name in the last minute".

Example response:

[
{
"id": "c38b0bcf-cb7c-4728-8704-2c2e267dcff9",
"name": "new collection",
"created_at": 1777853200.5,
"updated_at": 1777853200.5
},
{
"id": "11d6f8af-9a54-4b6c-b793-e12b77c86159",
"name": "new collection 2",
"created_at": 1777851000.1,
"updated_at": 1777851500.3
}
]

Timestamps are epoch seconds (UTC). null for collections created before this field was tracked.

GET /v1/async/collections/{collection_id}

Retrieves the collection whose ID matches the one provided as a parameter.

Example response:

{
"id": "c38b0bcf-cb7c-4728-8704-2c2e267dcff9",
"name": "new collection",
"created_at": 1777853200.5,
"updated_at": 1777853200.5
}

PUT /v1/async/collections/{collection_id}

This endpoint updates a collection with the ID provided as a parameter.

Both the name and the request list can be updated.

Returns a JSON object with the updated collection.

This request requires a body with the following structure:

{
"name":"Updated collection",
"requests": [
{
"url": "example.com",
"browser": true,
"screenshot": false,
"actions": [
{
"type": "wait-for-timeout",
"time": 5000,
},
]
}
]
}
  • name: Name of the project or collection.
  • requests: A list of requests. The parameters for this field are detailed in scrape.

Example response:

{
"id": "44a8c93d-a35e-4351-9565-7cd93b5ac296",
"name": "update collection",
"message": "Collection updated successfully."
}