Use the Batch API — it's the recommended path for anything more than a handful of URLs. scrape() is for one-off requests: debugging, webhooks that need an immediate answer, or single-URL health checks.
Scraping
Output formats
HTML (default)
result = client.scrape("https://example.com")
print(result.html)
Markdown (recommended for AI/LLM)
result = client.scrape("https://example.com", format="markdown")
print(result.markdown)
Convenience property
result.content returns whichever is available — markdown if present, otherwise HTML:
result = client.scrape("https://example.com", format="markdown")
print(result.content) # markdown
Downloading files (PDFs, images, binaries)
Since v0.6.0, client.scrape() also handles non-HTML responses — PDFs, images, ZIPs, and any other binary content. The response exposes three new fields (content_type, body_base64, body_url) and five helpers modelled on the requests library, so downloading a file is a one-liner:
resp = client.scrape("https://investors.example.com/quarterly-report.pdf", browser=False)
resp.save("quarterly-report.pdf")
Under the hood, resp.is_binary is True, resp.content_type is "application/pdf", and resp.body returns the decoded bytes. Use the is_binary flag to branch before reading text accessors:
resp = client.scrape(url)
if resp.is_binary:
# PDF, image, ZIP, etc. — text accessors return None
resp.save(f"out.{resp.content_type.split('/')[-1]}")
else:
print(resp.content) # markdown / html
print(resp.statusCode)
Available accessors
| Accessor | Returns | When to use |
|---|---|---|
resp.is_binary | bool | Branch on binary vs text |
resp.content_type | str | None | MIME of the response ("application/pdf", "text/html; charset=utf-8", …) |
resp.body | bytes | Always-bytes accessor (text gets UTF-8 encoded) |
resp.text | str | None | None for binary; safe text-only accessor |
resp.save(path) | int | Write to disk, returns bytes written |
resp.content | str | None | Legacy text-only convenience; None for binary |
resp.body_base64 | str | None | Wire format; almost always use body instead |
resp.body_url | str | None | Reserved for future blob offload (>5 MB) |
await resp.download_body() | bytes | Auto-detects inline vs offloaded |
resp.download_body_sync() | bytes | Sync version of above |
Common patterns
Download every PDF a site lists:
import os
for pdf_url in catalog_pdf_urls:
resp = client.scrape(pdf_url, browser=False)
if resp.is_binary and resp.content_type == "application/pdf":
resp.save(os.path.basename(pdf_url))
Branch by MIME type:
ext = {"application/pdf": "pdf", "image/png": "png", "image/jpeg": "jpg"}
resp = client.scrape(url)
if resp.is_binary:
suffix = ext.get(resp.content_type, "bin")
resp.save(f"file.{suffix}")
Mix text and binary in a batch:
batch = client.submit_batch("daily", urls)
for r in batch.iter_results():
if not r.guidance.success:
continue
if r.is_binary:
save_blob(r.custom_id, r.body)
else:
save_html(r.custom_id, r.content)
scrape_many() and batch_scrape() work the same way — every yielded / returned ScrapeResponse exposes is_binary and friends.
Browser rendering
For JavaScript-heavy sites (SPAs, React, Next.js), enable browser rendering:
result = client.scrape("https://spa-app.com", browser=True)
Automatic engine selection
When you pass browser=True, the API selects the best engine for each target domain automatically. You don't need to configure which browser to use — just ask for browser rendering and let the server route the request.
For harder sites (Google, Amazon, e-commerce with anti-bot), combine with retry_on_block and resource blocking to improve success rates:
# Standard — most JS sites
result = client.scrape("https://example.com", browser=True)
# Hard sites — retry on block + resource blocking
result = client.scrape(
"https://www.google.com/shopping/...",
browser=True,
retry_on_block=True,
block_resources=["image", "font", "media"],
)
Proxy rotation
Proxy rotation is on by default (use_proxy="any"). Every request goes through a different IP.
# Default: automatic proxy rotation
result = client.scrape("https://example.com")
# Disable proxy
result = client.scrape("https://example.com", use_proxy=None)
# Country-specific proxy (requires approval)
result = client.scrape("https://example.com", use_proxy="US")
Screenshots
Capture a full-page screenshot (requires browser=True):
result = client.scrape("https://example.com", browser=True, screenshot=True)
import base64
with open("screenshot.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))
Custom headers and cookies
result = client.scrape(
"https://example.com",
headers={"Accept-Language": "es-AR"},
cookies={"session": "abc123"},
language="es-AR",
)
POST requests
from scrapingpros import MethodPOST
result = client.scrape(
"https://api.example.com/data",
http_method=MethodPOST(payload={"query": "test"}),
)
POST to a different URL than the navigation target
Some sites require navigating to one page (to set cookies / generate a session) and POSTing to a different endpoint (an internal API or GraphQL). Set MethodPOST.url:
result = client.scrape(
"https://www.example.com/dashboard", # navigation target
http_method=MethodPOST(
url="https://api.example.com/graphql", # POST goes here
payload={"query": "..."},
),
)
If MethodPOST.url is omitted, the POST goes to the same URL as the scrape (default behavior).
Form-encoded POST (OAuth2, legacy APIs)
OAuth2 grant_type=client_credentials and most legacy form-based APIs require application/x-www-form-urlencoded request bodies, not JSON. Set content_type="form":
from scrapingpros import MethodPOST
resp = client.scrape(
"https://api.example.com/v1/oauth2/token",
http_method=MethodPOST(
payload={"grant_type": "client_credentials", "scope": "read"},
content_type="form", # default is "json"
),
headers={"Authorization": f"Basic {base64_creds}"},
)
Accepted values: "json" (default), "form". Available since v0.5.0.
Response fields
Every ScrapeResponse includes:
| Field | Type | Description |
|---|---|---|
content | str | Convenience: markdown if available, else HTML |
html | str | Raw HTML (when format="html") |
markdown | str | Clean text (when format="markdown") |
statusCode | int | HTTP status from target page |
executionTime | float | Seconds |
extracted_data | dict | Extracted data (see Data Extraction) |
evaluate_results | list | JS evaluation results (see JavaScript Execution) |
screenshot | str | Base64 PNG string |
guidance | ScrapeGuidance | Error analysis and next steps (see Response Guidance) |
network_requests | list | Captured network activity |
potentiallyBlockedByCaptcha | bool | CAPTCHA detection flag |
timings | dict | Performance breakdown |