Python

Web Scraping with Python in 2025: requests, BeautifulSoup, and APIs

From basic HTML parsing to JavaScript-rendered SPAs — a practical guide to choosing the right Python scraping approach for every situation.

April 2025 · 10 min read

The Python Web Scraping Stack

Python has the richest ecosystem for web scraping of any programming language. The combination of requests for HTTP, BeautifulSoup for HTML parsing, and lxml for XPath covers the majority of static scraping jobs. For JavaScript-rendered content, Playwright or Selenium launch a real browser, execute JavaScript, and expose the final DOM for parsing. Understanding when to use each approach — and when to skip the stack entirely in favor of a managed API — is the key skill for building reliable scrapers.

Approach 1: requests + BeautifulSoup

For static HTML pages with no JavaScript rendering, this combination is fast, simple, and requires minimal resources. Install with pip install requests beautifulsoup4 lxml and you are ready to parse any static page.

import requests
from bs4 import BeautifulSoup

resp = requests.get(
    "https://example.com/products",
    headers={"User-Agent": "Mozilla/5.0 (compatible; MyBot/1.0)"}
)
soup = BeautifulSoup(resp.text, "lxml")

products = []
for card in soup.select(".product-card"):
    products.append({
        "title": card.select_one("h2").get_text(strip=True),
        "price": card.select_one(".price").get_text(strip=True),
    })

print(f"Found {len(products)} products")

This approach works well for news sites, blogs, Wikipedia, government data portals, and any site that serves complete HTML from the server. It fails on React, Vue, and Angular SPAs where the server sends an empty shell and JavaScript populates the content.

Approach 2: Playwright for JavaScript Sites

Playwright launches a real Chromium browser and waits for JavaScript to render before exposing the DOM. Install with pip install playwright followed by playwright install chromium.

from playwright.sync_api import sync_playwright

with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://spa-example.com/products")
    page.wait_for_selector(".product-card")
    cards = page.query_selector_all(".product-card")
    for card in cards:
        title = card.query_selector("h2").inner_text()
        price = card.query_selector(".price").inner_text()
        print(title, price)
    browser.close()

Playwright is powerful but resource-heavy. Each browser instance consumes 150-300 MB of RAM. Running concurrent scrapers requires careful process management to avoid memory exhaustion. On cloud infrastructure, Chromium must be installed with the correct system dependencies — a surprisingly complex task on minimal Linux images.

Approach 3: Managed Scraping API

When you need JavaScript rendering but do not want to manage browsers on your servers, a managed scraping API like SnapAPI handles the browser infrastructure remotely. Your Python code stays simple — just an HTTP request — while SnapAPI runs Chromium, executes JavaScript, applies stealth mode, and returns the rendered HTML or extracted data.

import requests

# Get fully rendered HTML from a JavaScript SPA
resp = requests.get(
    "https://api.snapapi.pics/v1/scrape",
    params={"url": "https://spa-example.com/products", "wait_for": ".product-card"},
    headers={"X-Api-Key": "YOUR_API_KEY"}
)
html = resp.json()["html"]

# Now parse with BeautifulSoup as usual
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "lxml")
products = soup.select(".product-card")

This hybrid approach combines the simplicity of requests with JavaScript rendering capability. No browser installation, no memory management, no proxy rotation code. SnapAPI handles bot detection evasion, user agent rotation, and residential proxy routing when enabled.

When to Choose Each Approach

Use requests + BeautifulSoup for static sites, RSS feeds, Wikipedia, and government data — it is fast and free. Use Playwright when you need full browser control for interactive multi-step workflows like form filling and authentication. Use a managed API like SnapAPI when you need JavaScript rendering at scale without managing server infrastructure.

Sign up at snapapi.pics for 200 free scraping requests per month to test the managed API approach against your target sites. If it covers your use case, you will never need to install Chromium on a production server again.

Handling Anti-Bot Measures in Python

Modern websites increasingly use Cloudflare, DataDome, or custom anti-bot systems that block requests from scrapers. These systems detect headless browsers by checking for exposed JavaScript properties like navigator.webdriver, unusual plugin arrays, WebGL fingerprints, and canvas rendering signatures that differ between real and headless Chromium. Even with Playwright, you need to apply stealth patches to avoid detection on protected sites.

In Python, playwright-stealth applies the most common patches automatically. For residential proxy rotation to get past IP-based blocks, you need a proxy provider like Bright Data, Oxylabs, or Smartproxy. Each of these adds cost and complexity — more moving parts to configure, monitor, and pay for separately.

The managed API approach bundles stealth and proxy rotation into the service. SnapAPI stealth mode applies fingerprint patches automatically, and the proxy_country parameter enables residential routing without a separate proxy subscription. For most scraping targets, this is sufficient to retrieve data that would otherwise block a self-hosted Playwright instance.

Scaling Python Scraping with Celery and Redis

For production scraping workloads, Celery with a Redis broker is the standard Python queue setup. Define a task that accepts a URL, calls the SnapAPI scrape or extract endpoint, and writes the result to your database. The Celery beat scheduler triggers URL refresh cycles on your desired cadence — hourly, daily, or weekly.

from celery import Celery
import requests

app = Celery("scraper", broker="redis://localhost:6379/0")

@app.task(bind=True, max_retries=3, default_retry_delay=60)
def scrape_url(self, url: str):
    try:
        resp = requests.get(
            "https://api.snapapi.pics/v1/scrape",
            params={"url": url, "stealth": "true"},
            headers={"X-Api-Key": "YOUR_API_KEY"},
            timeout=30
        )
        resp.raise_for_status()
        return resp.json()["html"]
    except requests.HTTPError as exc:
        raise self.retry(exc=exc)

This Celery task retries automatically on HTTP errors, logs failures through Celery's task state tracking, and integrates with Flower for real-time monitoring of your scraping pipeline. Combined with SnapAPI, you get a reliable, scalable Python scraping system without managing any browser infrastructure.

Python Scraping Best Practices

Regardless of which approach you use, several best practices apply to all Python web scraping projects. First: always respect robots.txt and the target site's terms of service. Scraping public data for research and competitive intelligence is generally acceptable. Bypassing paywalls, scraping personal data, or overloading servers with excessive request rates is not.

Second: set realistic crawl delays. Hammering a site with 100 requests per second will trigger rate limits and IP bans — and is rude. Most well-behaved scrapers add a 1-2 second delay between requests per domain. With SnapAPI, your application makes HTTP calls to the SnapAPI server, which handles the actual browser request, so rate limiting applies at the SnapAPI quota level rather than directly to the target site.

Third: handle pagination carefully. Extract the next-page link or cursor from each result and feed it back into the next request. Use a visited URL set to avoid re-crawling pages you have already processed. For large crawls, persist the queue and progress in a database (Redis sorted sets or a PostgreSQL queue table work well) so the job can resume after an interruption.

Storing and Processing Scraped Data

Raw scraped HTML needs processing before it is useful. BeautifulSoup or lxml extract the data fields you care about from the HTML returned by the scrape endpoint. Pandas normalizes and cleans the extracted fields. SQLAlchemy or psycopg2 persists records to PostgreSQL with upsert logic so re-runs do not duplicate data. For large-scale analytics, write directly to BigQuery or a data lake in Parquet format.

Change detection is a common requirement for monitoring pipelines: you want to know when a competitor changes a price or a page layout shifts. Hash the extracted data on each run and compare to the previous hash stored in your database. When the hash changes, trigger an alert or downstream workflow. This simple pattern eliminates false positives from minor HTML whitespace changes.

Start your Python scraping project at snapapi.pics with 200 free requests per month. The scrape, extract, and screenshot endpoints are all available on the free tier. Combine with BeautifulSoup, Pandas, and your preferred database to build a complete data collection pipeline in an afternoon.

Python Scraping Quick Reference

Static HTML sites: use requests plus BeautifulSoup — fast, lightweight, zero browser overhead. JavaScript SPAs where you control the server: use Playwright with playwright-stealth for full browser control. JavaScript SPAs at scale without browser management overhead: use SnapAPI scrape endpoint from Python via requests. Structured data extraction without HTML parsing: use SnapAPI extract endpoint with CSS selectors. Bot-protected sites: use SnapAPI with stealth mode and proxy_country parameter. Batch pipelines: Celery plus Redis plus SnapAPI is the reliable production stack. Sign up at snapapi.pics for 200 free requests to test your chosen approach before committing to infrastructure.

python web scraping tutorial 2025 requests beautifulsoup playwright scrapy scraping api python http scraper javascript rendering
web scraping python 2025 beautifulsoup requests playwright api scraping tutorial
web scraping python api guide