Why Python Developers Replace Selenium and Playwright with a Capture API
The standard Python stack for programmatic webpage capture is Selenium or Playwright running a local headless Chromium. That works in a Jupyter notebook and falls apart the moment you try to ship it. Chromium adds 300+ MB to every Docker image, which pushes Lambda packages past the 250 MB unzipped limit and turns ECS task definitions into slow-rolling deployments. Each Chromium instance holds 400 to 600 MB of resident memory, so a FastAPI worker handling a dozen concurrent capture requests needs multiple gigabytes of RAM dedicated purely to browsers. And the crashes — OOM kills, zombie renderer processes, X11 failures in minimal Linux images — all need watchdog logic that has nothing to do with your business problem. A hosted webpage capture API replaces all of that with a single HTTPS call: pass the URL, receive the PNG or PDF bytes, move on.
Minimal Requests Example
import os, requests
API_KEY = os.environ["SNAPAPI_KEY"]
def capture(url, format="png", full_page=True, width=1280, height=800):
params = {
"access_key": API_KEY,
"url": url,
"format": format,
"full_page": "1" if full_page else "0",
"viewport_width": width,
"viewport_height": height,
}
r = requests.get("https://snapapi.pics/screenshot", params=params, timeout=30)
r.raise_for_status()
return r.content
png = capture("https://example.com")
with open("out.png", "wb") as f:
f.write(png)
pdf = capture("https://example.com", format="pdf")
with open("out.pdf", "wb") as f:
f.write(pdf)
Async with httpx
For any pipeline doing more than a handful of captures, sync requests blocks the whole worker on network I/O. httpx with asyncio gives you real concurrency without spawning threads:
import os, asyncio, httpx
API_KEY = os.environ["SNAPAPI_KEY"]
BASE = "https://snapapi.pics/screenshot"
async def capture(client, url, format="png"):
params = {"access_key": API_KEY, "url": url, "format": format, "full_page": "1"}
r = await client.get(BASE, params=params, timeout=30.0)
r.raise_for_status()
return url, r.content
async def capture_many(urls, concurrency=8):
sem = asyncio.Semaphore(concurrency)
async with httpx.AsyncClient() as client:
async def bounded(u):
async with sem:
return await capture(client, u)
return await asyncio.gather(*(bounded(u) for u in urls), return_exceptions=True)
urls = ["https://example.com", "https://python.org", "https://fastapi.tiangolo.com"]
results = asyncio.run(capture_many(urls))
for r in results:
if isinstance(r, Exception):
print("err:", r)
else:
url, data = r
print(url, len(data), "bytes")
FastAPI Download Endpoint
Wrap the API behind your own FastAPI route so the rest of your app doesn't need to know where the screenshot came from. Stream the response straight through to avoid buffering multi-megabyte full-page screenshots in memory:
import os, httpx
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
app = FastAPI()
API_KEY = os.environ["SNAPAPI_KEY"]
@app.get("/capture")
async def capture(url: str, format: str = "png"):
params = {"access_key": API_KEY, "url": url, "format": format, "full_page": "1"}
client = httpx.AsyncClient(timeout=30.0)
req = client.build_request("GET", "https://snapapi.pics/screenshot", params=params)
upstream = await client.send(req, stream=True)
if upstream.status_code != 200:
await upstream.aclose()
await client.aclose()
raise HTTPException(status_code=502, detail="capture failed")
mime = "application/pdf" if format == "pdf" else f"image/{format}"
filename = f"capture.{ 'pdf' if format == 'pdf' else format }"
async def pipe():
async for chunk in upstream.aiter_bytes():
yield chunk
await upstream.aclose()
await client.aclose()
return StreamingResponse(
pipe(),
media_type=mime,
headers={"Content-Disposition": f'attachment; filename="{filename}"'},
)
Celery Batch Jobs
For pipelines that process thousands of URLs (competitor monitoring, archival, QA snapshots), hand each capture to a Celery task and let the broker handle concurrency:
import os, requests
from celery import Celery
app = Celery("captures", broker=os.environ["REDIS_URL"])
API_KEY = os.environ["SNAPAPI_KEY"]
@app.task(autoretry_for=(requests.HTTPError,), retry_backoff=True, max_retries=3)
def capture_to_s3(url: str, bucket: str, key: str):
import boto3
params = {"access_key": API_KEY, "url": url, "format": "png", "full_page": "1"}
r = requests.get("https://snapapi.pics/screenshot", params=params, timeout=30)
r.raise_for_status()
s3 = boto3.client("s3")
s3.put_object(
Bucket=bucket,
Key=key,
Body=r.content,
ContentType="image/png",
CacheControl="public, max-age=86400",
)
return f"s3://{bucket}/{key}"
Error Handling and Retries You Actually Need
The three failure modes worth writing code for are HTTP 429 (rate limited), 5xx (transient backend hiccups), and connection resets. Everything else — bad URLs, timeouts on the target site, auth errors — should surface to the caller immediately. The retry pattern that survives is exponential backoff with jitter, capped at three attempts, with the Retry-After header respected when the server returns 429:
import time, random, requests
def capture_with_retry(url, max_attempts=3):
for attempt in range(max_attempts):
try:
r = requests.get("https://snapapi.pics/screenshot", params={
"access_key": os.environ["SNAPAPI_KEY"],
"url": url, "format": "png", "full_page": "1",
}, timeout=30)
if r.status_code == 429:
wait = int(r.headers.get("Retry-After", 2 ** attempt))
time.sleep(wait + random.random())
continue
if 500 <= r.status_code < 600:
time.sleep((2 ** attempt) + random.random())
continue
r.raise_for_status()
return r.content
except (requests.ConnectionError, requests.Timeout):
if attempt == max_attempts - 1:
raise
time.sleep((2 ** attempt) + random.random())
raise RuntimeError("capture failed after retries")
When to Use Hosted vs. Self-Hosted
Self-hosted Playwright still makes sense if you need full programmatic control of the browser — intercepting requests mid-flight, injecting cookies, or running end-to-end test suites. For webpage capture specifically — feeding pages through a pipeline, generating thumbnails, archiving content, producing PDF reports — a hosted API is nearly always cheaper once you factor in the ops cost of running Chromium fleets. SnapAPI handles the browser pool, stealth evasions, device emulation, ad and cookie blocking, and full-page rendering with sticky headers out of the box.
Start Capturing Webpages from Python in Under a Minute
SnapAPI's free tier gives you 200 captures per month — enough to prototype a pipeline, test the retry logic, and benchmark real network latency from your environment before committing. Grab a key at snapapi.pics/register and drop it into any of the examples above. No browser binaries, no Dockerfile surgery, no memory monitoring.