Viability Test Flow
The viability test allows you to evaluate a set of URLs before scraping them, to automatically determine which scraping strategy to use for each one.
The complete flow has two steps:
- Submit the URLs for analysis and obtain a
run_id. - Query the
run_iduntil the analysis is complete and read the recommended strategy.
1. Submit the URLs for analysis
Send the list of URLs you want to evaluate. The response is immediate and returns a run_id for tracking.
Endpoint
POST /v1/async/viability-test
Request Body Example
{
"urls": [
"https://www.example-static.com",
"https://www.booking.com",
"https://www.example-captcha.com"
]
}
Expected Response
{
"run_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"status": "in_progress",
"total_urls": 3,
"completed_urls": 0
}
Save the run_id -- it is needed to query the results in the next step.
2. Query the results
The analysis runs in the background. You need to poll periodically until status is completed.
Endpoint
GET /v1/async/viability-test/{run_id}
Response while analysis is in progress
{
"run_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"status": "in_progress",
"total_urls": 3,
"completed_urls": 1,
"results": null
}
Response when analysis is complete
{
"run_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"status": "completed",
"total_urls": 3,
"completed_urls": 3,
"results": [
{
"url": "https://www.example-static.com",
"recommended_strategy": "extract_html",
...
},
{
"url": "https://www.booking.com",
"recommended_strategy": "browser",
...
},
{
"url": "https://www.example-captcha.com",
"recommended_strategy": "blocked",
...
}
]
}
3. Interpret the results and scrape
Each URL in results includes a recommended_strategy. Below is a description of what each one means and how to proceed with scraping.
extract_html -- Static HTML available
The site serves its content directly in the HTTP response, without requiring JavaScript. No blocks were detected.
What to do: scrape without browser. This is the fastest option with the lowest resource consumption.
{
"url": "https://www.example-static.com",
"recommended_strategy": "extract_html",
"javascript_required": false,
"captcha_detected": false,
"cloudflare_level": "none",
"can_use_extract_html": true,
"can_use_browser": true
}
browser -- The site requires JavaScript
The main content is rendered via JavaScript. A simple HTTP request would return empty or incomplete HTML. The site does not present active blocks.
What to do: scrape with browser enabled. The browser will execute the JS and wait for the content to become available.
{
"url": "https://www.booking.com",
"recommended_strategy": "browser",
"javascript_required": true,
"browser_confidence": 0.92,
"captcha_detected": false,
"can_use_extract_html": false,
"can_use_browser": true
}
api -- Data endpoints detected
During the browser analysis, JSON or XHR endpoints were detected that expose the data directly, without requiring authentication. The endpoints are listed in api_endpoints.
What to do: scrape without browser, pointing directly to the endpoints listed in api_endpoints. This is the most efficient strategy when available.
{
"url": "https://www.example-booking.com",
"recommended_strategy": "api",
"api_detected": true,
"api_endpoints": [
"https://www.example-booking.com/avl",
"https://www.example-booking.com/api/search"
],
"api_auth_required": false,
"can_use_simple_request": true
}
blocked -- The site cannot be scraped under normal conditions
One or more active barriers were detected: captcha, Cloudflare challenge, or login wall. Scraping is not possible without additional intervention.
What to do: review the captcha_providers, cloudflare_level, and login_wall fields to understand which specific barrier was encountered. This may require the use of residential proxies, captcha solving, or manual analysis of the authentication flow.
{
"url": "https://www.example-captcha.com",
"recommended_strategy": "blocked",
"captcha_detected": true,
"captcha_providers": ["cloudflare"],
"cloudflare_level": "challenge",
"can_use_extract_html": false,
"can_use_browser": false
}
Summary
The viability test flow always follows the same two-step pattern:
- Submit the URLs -> obtain
run_id. - Query the
run_id-> readrecommended_strategyper URL and scrape accordingly.
| Strategy | Needs browser | When |
|---|---|---|
extract_html | No | Static HTML, no JS or blocks |
browser | Yes | Content rendered with JS |
api | No | JSON/XHR endpoints detected and accessible |
blocked | -- | Active captcha, Cloudflare challenge, or login wall |