Web Scraping API for Python: Extract Data Without Browser Infrastructure

Replace Playwright and Selenium with a single API call. SnapAPI handles JavaScript rendering, anti-bot bypass, and proxy rotation so your Python scraper stays lean and focused on data extraction logic.

Start Free — 200 captures/mo View Docs

Python Web Scraping API Quickstart

Extracting data from a JavaScript-rendered page requires only the requests library. Pass the URL and a CSS selector, and SnapAPI returns the matched text after full browser rendering:

import requests
import os

SNAPAPI_KEY = os.environ["SNAPAPI_KEY"]
BASE_URL = "https://api.snapapi.pics/v1"

def extract(url: str, selector: str, wait_for: str = None) -> str:
    payload = {"url": url, "selector": selector}
    if wait_for:
        payload["wait_for"] = wait_for
    response = requests.post(
        f"{BASE_URL}/extract",
        json=payload,
        headers={"X-Api-Key": SNAPAPI_KEY},
        timeout=30
    )
    response.raise_for_status()
    return response.json()["text"]

# Extract a product price from a JS-rendered page
price = extract(
    url="https://shop.example.com/product/widget",
    selector=".product-price",
    wait_for=".product-price"
)
print(f"Price: {price}")  # "$29.99"

The wait_for parameter instructs the API to wait until the CSS selector is visible in the DOM before extracting. This handles lazy-loaded content, infinite scroll pages, and React/Vue/Angular applications that populate data after the initial render.

Scraping Full Page HTML

For cases where you need the complete rendered HTML to parse with BeautifulSoup or lxml, use the scrape endpoint:

from bs4 import BeautifulSoup

def scrape(url: str, wait_for: str = None) -> BeautifulSoup:
    payload = {"url": url}
    if wait_for:
        payload["wait_for"] = wait_for
    response = requests.post(
        f"{BASE_URL}/scrape",
        json=payload,
        headers={"X-Api-Key": SNAPAPI_KEY},
        timeout=30
    )
    response.raise_for_status()
    html = response.json()["html"]
    return BeautifulSoup(html, "html.parser")

soup = scrape("https://news.ycombinator.com", wait_for=".athing")
titles = [a.get_text() for a in soup.select(".athing .titleline a")]
print(titles[:5])

The scrape endpoint returns the fully rendered HTML after all JavaScript has executed. Parse it with BeautifulSoup using the html.parser or lxml backend — the same workflow as scraping static pages, but with full JavaScript support handled remotely.

Async Python Scraping with httpx

For high-throughput scraping pipelines, use httpx with asyncio to fire many requests concurrently. This pattern processes hundreds of URLs in parallel while respecting concurrency limits:

import asyncio
import httpx

async def extract_async(client: httpx.AsyncClient, url: str, selector: str) -> dict:
    response = await client.post(
        "https://api.snapapi.pics/v1/extract",
        json={"url": url, "selector": selector, "wait_for": selector},
        headers={"X-Api-Key": SNAPAPI_KEY},
        timeout=30
    )
    return {"url": url, "text": response.json().get("text", "")}

async def scrape_all(urls: list[str], selector: str, concurrency: int = 5) -> list[dict]:
    semaphore = asyncio.Semaphore(concurrency)
    async with httpx.AsyncClient() as client:
        async def bounded(url):
            async with semaphore:
                return await extract_async(client, url, selector)
        return await asyncio.gather(*[bounded(url) for url in urls])

# Usage
results = asyncio.run(scrape_all(product_urls, ".product-price", concurrency=5))
for r in results:
    print(r["url"], r["text"])

The asyncio.Semaphore caps the number of concurrent API calls to 5, which is safe for the Starter plan. Increase to 20 for Pro accounts. asyncio.gather collects results in the order URLs were provided, even though requests complete out of order.

Pandas Integration for Data Pipelines

Scraped results feed naturally into pandas DataFrames for analysis, deduplication, and export:

import pandas as pd

results = asyncio.run(scrape_all(product_urls, ".product-price"))
df = pd.DataFrame(results)
df["price_numeric"] = df["text"].str.replace(r"[^\d.]", "", regex=True).astype(float)
df = df.dropna(subset=["price_numeric"])
df.to_csv("prices.csv", index=False)
print(df.describe())

Clean the extracted price strings with a regex that strips currency symbols and formatting, then cast to float for numerical analysis. Export to CSV, write to PostgreSQL with df.to_sql, or pass to a visualization library for price trend charts.

Handling Anti-Bot Protection in Python

Many commercial sites deploy bot detection that blocks requests from Playwright, Selenium, or standard HTTP clients. SnapAPI's stealth mode bypasses these systems transparently. Enable it by adding "stealth": true to your request payload:

response = requests.post(
    f"{BASE_URL}/extract",
    json={
        "url": url,
        "selector": ".price",
        "stealth": True,
        "wait_for": ".price"
    },
    headers={"X-Api-Key": SNAPAPI_KEY},
    timeout=30
)

Stealth mode selects a browser fingerprint optimized for the target domain, handles CAPTCHA avoidance at the infrastructure level, and rotates residential proxies when needed. Updates are deployed server-side — your Python code requires no changes as anti-bot vendors evolve their detection.

Get started at snapapi.pics — 200 free extractions per month, no credit card required. The Python SDK at github.com/Sleywill/snapapi-python wraps all endpoints with type hints and async support. Install with pip install snapapi-python.

Scheduling Python Scraping Jobs

Most scraping pipelines run on a schedule — daily competitor price checks, hourly inventory updates, or weekly content audits. Use APScheduler for in-process scheduling or Celery Beat for distributed job scheduling:

from apscheduler.schedulers.blocking import BlockingScheduler import psycopg2, json, asyncio def monitor_prices(): conn = psycopg2.connect(os.environ["DATABASE_URL"]) cur = conn.cursor() cur.execute("SELECT url FROM products WHERE active = true") urls = [row[0] for row in cur.fetchall()] results = asyncio.run(scrape_all(urls, ".product-price", concurrency=5)) for r in results: cur.execute( "INSERT INTO price_history (url, price, captured_at) VALUES (%s, %s, NOW())", (r["url"], r["text"]) ) conn.commit() print(f"Captured {len(results)} prices") scheduler = BlockingScheduler() scheduler.add_job(monitor_prices, "cron", hour=6, minute=0) scheduler.start()

This pattern runs the price monitoring job every day at 6 AM. APScheduler handles timezone awareness, missed job recovery, and thread isolation. For production deployments, run the scheduler as a dedicated process separate from your web application, using supervisor or systemd for process management.

Error Handling and Logging for Python Scrapers

Configure Python's logging module to record every scrape attempt with URL, status code, and extracted value. Log failures to a separate error log and alert on high failure rates using a monitoring service like Sentry or Datadog. A failure rate above 10% on a specific domain typically indicates the site has changed its HTML structure and your selectors need updating:

import logging logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s") logger = logging.getLogger(__name__) def safe_extract(url: str, selector: str) -> str | None: try: text = extract(url, selector) logger.info(f"OK {url}: {text[:50]}") return text except requests.HTTPError as e: logger.error(f"HTTP {e.response.status_code} for {url}") except Exception as e: logger.error(f"Error scraping {url}: {e}") return None

Return None on failure and filter it out downstream rather than raising — this keeps the batch job running even when individual URLs fail. Aggregate failure counts per domain after each run to detect when a site layout change breaks your selectors.

Python Web Scraping vs Playwright: When to Use Each

Playwright is the right choice when you need to control the browser interactively — filling forms, clicking buttons, intercepting network requests, or performing actions that require a persistent browser session. For data extraction where you pass a URL and expect structured output, SnapAPI is faster to integrate, cheaper to run at scale, and requires no browser binaries in your deployment environment.

The practical difference: a Playwright scraper running on a 2 vCPU, 4 GB RAM server can handle roughly 3 to 5 concurrent browser sessions before hitting memory limits. The same server running Python with SnapAPI can fire hundreds of concurrent HTTP requests without any additional resource constraint. For teams with scraping workloads that scale beyond a single server, SnapAPI eliminates the need to provision and manage a browser fleet.

Pricing for Python Scraping Workloads

SnapAPI's free tier provides 200 extractions per month — enough to prototype and test your Python scraper. The Starter plan at $19 per month provides 5,000 captures, covering daily monitoring of 150 URLs. The Pro plan at $79 per month covers 50,000 captures, supporting hourly monitoring of 1,600 URLs or daily monitoring of nearly 1,700 URLs with room for re-attempts. Business at $299 per month provides 500,000 captures for large-scale scraping operations.

Get started at snapapi.pics — 200 free extractions per month, no credit card required. Browse the Python SDK at github.com/Sleywill/snapapi-python for a typed async client that wraps all endpoints. Full API reference is at snapapi.pics/docs.html with Python code examples for every endpoint.

Python Scraping for Machine Learning and AI Pipelines

Python is the dominant language for machine learning and AI data pipelines. Web scraping with SnapAPI feeds clean, structured text and HTML into these pipelines without the overhead of running browsers alongside your training or inference infrastructure.

Common patterns include scraping news articles and blog posts to build text corpora, extracting product descriptions and reviews for fine-tuning e-commerce recommendation models, monitoring competitor content for market intelligence dashboards, and capturing screenshots of web pages as visual training data for multimodal models.

SnapAPI's AI analysis endpoint goes further: pass a URL and a question, and the API returns a structured answer generated by an LLM analyzing the live page content. This is useful for rapid content classification and summarization tasks that do not require training a custom model.

Start building your Python data pipeline at snapapi.pics — 200 free captures per month, no credit card required. Documentation and Python code examples are at snapapi.pics/docs.html.