GuidesApril 5, 2026

Web Scraping Without Getting Blocked: Headers, Proxies, and Stealth Mode

Why scraping requests get blocked, and practical techniques to avoid it — from correct request headers and rate limiting to proxy rotation and browser fingerprint stealth.

Why Scrapers Get Blocked

Websites detect and block scrapers using several layers of signals. Understanding each layer helps you address them systematically rather than guessing why requests fail.

IP reputation and rate detection is the most common block mechanism. Sites track request frequency per IP address. A single IP making hundreds of requests per hour to the same domain triggers rate limits almost universally. Residential IPs are treated far more leniently than datacenter IPs — AWS, Google Cloud, and DigitalOcean IP ranges are often pre-blocked or heavily throttled by sites that have learned to recognize them.

HTTP header fingerprinting identifies non-browser requests. A real browser sends a distinctive set of headers: Accept, Accept-Language, Accept-Encoding, Connection, Sec-Fetch-*, and others that vary by browser version. A plain Python requests call sends a minimal header set that bot detection systems recognize immediately.

TLS fingerprinting examines the TLS handshake parameters — cipher suites, extensions, and their ordering — before any HTTP headers are evaluated. Python's requests library and Node.js https module produce TLS fingerprints that differ from Chromium's, allowing sites using services like Cloudflare to identify scrapers at the network layer before looking at any HTTP content.

Browser automation detection applies to headless browser scrapers. Sites run JavaScript checks for signals like navigator.webdriver === true, the presence of Playwright's __playwright global, Chrome's window.chrome.runtime being undefined in headless mode, and dozens of other automation artifacts that differ from a real browser session.

Layer 1: Request Headers

Set a realistic browser User-Agent and the supporting headers that real browsers send alongside it. An accurate Chrome User-Agent with missing Sec-Fetch-* headers is still a suspicious combination — bot detection systems look at the full header set:

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) '
                  'AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/123.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
    'Upgrade-Insecure-Requests': '1',
}

session = requests.Session()
session.headers.update(headers)
response = session.get('https://example.com')

Use a requests.Session() to maintain cookies and headers across requests to the same domain, mimicking a real browser session that navigates through multiple pages. Keep the User-Agent string updated — outdated Chrome versions can trigger suspicion on sites that track major release versions.

Layer 2: Rate Limiting and Human-Like Timing

Space requests with random delays to avoid the perfectly uniform inter-request timing that bot detection systems flag. A real user clicking through pages introduces variable delays:

import time
import random

def scrape_with_delay(urls: list[str], min_delay=1.5, max_delay=4.0):
    results = []
    for url in urls:
        try:
            response = session.get(url, timeout=15)
            results.append({'url': url, 'html': response.text})
        except Exception as e:
            results.append({'url': url, 'error': str(e)})
        # Random delay between requests
        time.sleep(random.uniform(min_delay, max_delay))
    return results

A 1.5 to 4 second random delay keeps request rates low enough to avoid most per-IP rate limits while still processing a meaningful number of URLs per hour. For high-value targets with aggressive rate limiting, increase the range to 5 to 15 seconds and reduce concurrency to a single thread.

Layer 3: Proxy Rotation

Residential proxy networks rotate your outgoing IP address across a pool of real consumer IPs, making each request appear to come from a different household. This addresses IP-level blocks and rate limits that target specific addresses. Commercial residential proxy providers include Bright Data, Oxylabs, and Smartproxy.

import random

PROXY_LIST = [
    "http://user:pass@proxy1.provider.com:8080",
    "http://user:pass@proxy2.provider.com:8080",
    "http://user:pass@proxy3.provider.com:8080",
]

def get_proxy():
    proxy = random.choice(PROXY_LIST)
    return {'http': proxy, 'https': proxy}

response = session.get(url, proxies=get_proxy(), timeout=15)

Residential proxies add cost — typically $5 to $15 per GB of traffic. For high-volume scraping where individual IP blocks are a persistent problem, this cost is usually justified. For moderate volumes, the header and rate limiting techniques above are often sufficient.

Layer 4: Browser Fingerprint Stealth via API

For sites that require full browser fingerprint stealth — Cloudflare-protected pages, JavaScript challenge sites, and platforms with aggressive DataDome or PerimeterX deployments — delegating rendering to a managed stealth browser API eliminates the need to maintain fingerprint patches yourself:

import requests, os

def stealth_extract(url: str, selector: str) -> str:
    response = requests.post(
        "https://api.snapapi.pics/v1/extract",
        json={"url": url, "selector": selector, "stealth": True, "wait_for": selector},
        headers={"X-Api-Key": os.environ["SNAPAPI_KEY"]},
        timeout=30
    )
    response.raise_for_status()
    return response.json()["text"]

price = stealth_extract("https://protected-shop.example.com/product/123", ".price")
print(price)

SnapAPI's stealth infrastructure handles browser fingerprint randomization, TLS fingerprint matching, proxy selection, and anti-bot bypass at the infrastructure level. Your Python code stays simple — no playwright-extra plugins to install, no fingerprint patches to maintain as detection vendors update their systems.

Start scraping without getting blocked at snapapi.pics — 200 free captures per month, no credit card required. The extract and scrape endpoints with stealth mode enabled are available on all paid plans.

Browser Fingerprinting and How to Avoid Detection

Modern anti-bot systems go far beyond checking the User-Agent string. They fingerprint the browser environment: canvas rendering output, WebGL renderer strings, AudioContext node counts, installed fonts, screen resolution relative to reported viewport, and the presence of automation-specific navigator properties like navigator.webdriver. A scraper that spoofs only the User-Agent while leaving these signals intact is trivially detected.

Stealth patching libraries like playwright-extra with the stealth plugin suppress most of these signals at the JavaScript level. They override navigator.webdriver, randomize canvas noise, and inject realistic plugin arrays. However, stealth patches require constant maintenance as anti-bot vendors update their detection heuristics — what bypasses detection today may fail next month.

A more durable approach is to use a managed browser API that maintains its own stealth layer and updates it continuously. SnapAPI's stealth mode handles fingerprint suppression, real browser rendering, and JavaScript execution without requiring your codebase to track anti-bot patches. When a new detection vector emerges, the API layer absorbs the update transparently.

Request Pacing and Crawl Politeness

Even with perfect fingerprint masking, scraping too fast from a single IP triggers volume-based blocks. Sites monitor request frequency per session, per IP, and increasingly per browser fingerprint cluster. Crawl politeness — obeying reasonable delays between requests — reduces detection risk and also aligns with robots.txt expectations where applicable.

Practical pacing guidelines: introduce 1–3 second random delays between page requests, avoid more than 10 concurrent requests to the same domain, and randomize request order rather than following alphabetical or sequential URL patterns. Sequential patterns are a strong signal of automation.

For large-scale crawls, distribute requests across multiple proxy IPs and rotate them on a per-domain or per-session basis rather than per-request. Frequent per-request rotation is itself a detectable pattern — IPs that appear for a single request and vanish are flagged as proxy pools by threat intelligence systems.

The Simplest Path: Managed Browser APIs

Every technique covered in this guide — stealth patching, proxy rotation, request pacing, CAPTCHA handling, header normalization — adds operational complexity to your scraping stack. Each layer requires maintenance, monitoring, and updates as detection systems evolve. For most engineering teams, the total cost of ownership exceeds the cost of a managed API within a few months.

SnapAPI's scrape and extract endpoints handle all of these concerns behind a single HTTP call. Stealth mode, residential proxy routing, JavaScript rendering, and cookie consent dismissal are included. Sign up at snapapi.pics for 200 free requests per month and make your first unblocked scrape in under five minutes.

CAPTCHA Handling Strategies

CAPTCHAs are the last line of defense most sites deploy after behavioral and fingerprint checks fail to stop a scraper. The three most common types are image-based reCAPTCHA v2, the invisible reCAPTCHA v3 which scores sessions rather than presenting a challenge, and hCaptcha which is widely used on Cloudflare-protected domains. Solving them programmatically requires either a third-party CAPTCHA solving service with human solvers or an AI-based solver for image challenges. SnapAPI handles cookie consent popups and soft bot gates automatically but does not solve hard CAPTCHAs by design, as doing so on sites that explicitly gate access raises legal and terms-of-service considerations your legal team should review.

For sites protected by Cloudflare Turnstile or Imperva, the most reliable approach is to use residential proxy IP addresses that have not been flagged, combined with real browser rendering. Many CAPTCHA triggers are proxy-quality checks rather than genuine puzzle challenges, meaning a clean residential IP often bypasses the gate entirely without needing to solve anything.