Scraping many URLs?

Use the Batch API — it's the recommended path for anything more than a handful of URLs. scrape() is for one-off requests: debugging, webhooks that need an immediate answer, or single-URL health checks.

Scraping

Output formats

HTML (default)

result = client.scrape("https://example.com")
print(result.html)

Markdown (recommended for AI/LLM)

result = client.scrape("https://example.com", format="markdown")
print(result.markdown)

Convenience property

result.content returns whichever is available — markdown if present, otherwise HTML:

result = client.scrape("https://example.com", format="markdown")
print(result.content)  # markdown

Downloading files (PDFs, images, binaries)

Since v0.6.0, client.scrape() also handles non-HTML responses — PDFs, images, ZIPs, and any other binary content. The response exposes three new fields (content_type, body_base64, body_url) and five helpers modelled on the requests library, so downloading a file is a one-liner:

resp = client.scrape("https://investors.example.com/quarterly-report.pdf", browser=False)
resp.save("quarterly-report.pdf")

Under the hood, resp.is_binary is True, resp.content_type is "application/pdf", and resp.body returns the decoded bytes. Use the is_binary flag to branch before reading text accessors:

resp = client.scrape(url)
if resp.is_binary:
    # PDF, image, ZIP, etc. — text accessors return None
    resp.save(f"out.{resp.content_type.split('/')[-1]}")
else:
    print(resp.content)         # markdown / html
    print(resp.statusCode)

Available accessors

Accessor	Returns	When to use
`resp.is_binary`	`bool`	Branch on binary vs text
`resp.content_type`	`str \| None`	MIME of the response (`"application/pdf"`, `"text/html; charset=utf-8"`, …)
`resp.body`	`bytes`	Always-bytes accessor (text gets UTF-8 encoded)
`resp.text`	`str \| None`	`None` for binary; safe text-only accessor
`resp.save(path)`	`int`	Write to disk, returns bytes written
`resp.content`	`str \| None`	Legacy text-only convenience; `None` for binary
`resp.body_base64`	`str \| None`	Wire format; almost always use `body` instead
`resp.body_url`	`str \| None`	Reserved for future blob offload (>5 MB)
`await resp.download_body()`	`bytes`	Auto-detects inline vs offloaded
`resp.download_body_sync()`	`bytes`	Sync version of above

Common patterns

Download every PDF a site lists:

import os

for pdf_url in catalog_pdf_urls:
    resp = client.scrape(pdf_url, browser=False)
    if resp.is_binary and resp.content_type == "application/pdf":
        resp.save(os.path.basename(pdf_url))

Branch by MIME type:

ext = {"application/pdf": "pdf", "image/png": "png", "image/jpeg": "jpg"}
resp = client.scrape(url)
if resp.is_binary:
    suffix = ext.get(resp.content_type, "bin")
    resp.save(f"file.{suffix}")

Mix text and binary in a batch:

batch = client.submit_batch("daily", urls)
for r in batch.iter_results():
    if not r.guidance.success:
        continue
    if r.is_binary:
        save_blob(r.custom_id, r.body)
    else:
        save_html(r.custom_id, r.content)

scrape_many() and batch_scrape() work the same way — every yielded / returned ScrapeResponse exposes is_binary and friends.

Browser rendering

For JavaScript-heavy sites (SPAs, React, Next.js), enable browser rendering:

result = client.scrape("https://spa-app.com", browser=True)

Automatic engine selection

When you pass browser=True, the API selects the best engine for each target domain automatically. You don't need to configure which browser to use — just ask for browser rendering and let the server route the request.

For harder sites (Google, Amazon, e-commerce with anti-bot), combine with retry_on_block and resource blocking to improve success rates:

# Standard — most JS sites
result = client.scrape("https://example.com", browser=True)

# Hard sites — retry on block + resource blocking
result = client.scrape(
    "https://www.google.com/shopping/...",
    browser=True,
    retry_on_block=True,
    block_resources=["image", "font", "media"],
)

Proxy rotation

Proxy rotation is on by default (use_proxy="any"). Every request goes through a different IP.

# Default: automatic proxy rotation
result = client.scrape("https://example.com")

# Disable proxy
result = client.scrape("https://example.com", use_proxy=None)

# Country-specific proxy (requires approval)
result = client.scrape("https://example.com", use_proxy="US")

Screenshots

Capture a full-page screenshot (requires browser=True):

result = client.scrape("https://example.com", browser=True, screenshot=True)

import base64
with open("screenshot.png", "wb") as f:
    f.write(base64.b64decode(result.screenshot))

Custom headers and cookies

result = client.scrape(
    "https://example.com",
    headers={"Accept-Language": "es-AR"},
    cookies={"session": "abc123"},
    language="es-AR",
)

POST requests

from scrapingpros import MethodPOST

result = client.scrape(
    "https://api.example.com/data",
    http_method=MethodPOST(payload={"query": "test"}),
)

Some sites require navigating to one page (to set cookies / generate a session) and POSTing to a different endpoint (an internal API or GraphQL). Set MethodPOST.url:

result = client.scrape(
    "https://www.example.com/dashboard",       # navigation target
    http_method=MethodPOST(
        url="https://api.example.com/graphql", # POST goes here
        payload={"query": "..."},
    ),
)

If MethodPOST.url is omitted, the POST goes to the same URL as the scrape (default behavior).

Form-encoded POST (OAuth2, legacy APIs)

OAuth2 grant_type=client_credentials and most legacy form-based APIs require application/x-www-form-urlencoded request bodies, not JSON. Set content_type="form":

from scrapingpros import MethodPOST

resp = client.scrape(
    "https://api.example.com/v1/oauth2/token",
    http_method=MethodPOST(
        payload={"grant_type": "client_credentials", "scope": "read"},
        content_type="form",   # default is "json"
    ),
    headers={"Authorization": f"Basic {base64_creds}"},
)

Accepted values: "json" (default), "form". Available since v0.5.0.

Response fields

Every ScrapeResponse includes:

Field	Type	Description
`content`	`str`	Convenience: markdown if available, else HTML
`html`	`str`	Raw HTML (when `format="html"`)
`markdown`	`str`	Clean text (when `format="markdown"`)
`statusCode`	`int`	HTTP status from target page
`executionTime`	`float`	Seconds
`extracted_data`	`dict`	Extracted data (see Data Extraction)
`evaluate_results`	`list`	JS evaluation results (see JavaScript Execution)
`screenshot`	`str`	Base64 PNG string
`guidance`	`ScrapeGuidance`	Error analysis and next steps (see Response Guidance)
`network_requests`	`list`	Captured network activity
`potentiallyBlockedByCaptcha`	`bool`	CAPTCHA detection flag
`timings`	`dict`	Performance breakdown

Output formats​

HTML (default)​

Markdown (recommended for AI/LLM)​

Convenience property​

Downloading files (PDFs, images, binaries)​

Available accessors​

Common patterns​

Browser rendering​

Automatic engine selection​

Proxy rotation​

Screenshots​

Custom headers and cookies​

POST requests​

POST to a different URL than the navigation target​

Form-encoded POST (OAuth2, legacy APIs)​

Response fields​