Web Scraping Without Getting Blocked: Headers, Proxies, and Stealth Mode
Why scraping requests get blocked, and practical techniques to avoid it — from correct request headers and rate limiting to proxy rotation and browser fingerprint stealth.
Why Scrapers Get Blocked
Websites detect and block scrapers using several layers of signals. Understanding each layer helps you address them systematically rather than guessing why requests fail.
IP reputation and rate detection is the most common block mechanism. Sites track request frequency per IP address. A single IP making hundreds of requests per hour to the same domain triggers rate limits almost universally. Residential IPs are treated far more leniently than datacenter IPs — AWS, Google Cloud, and DigitalOcean IP ranges are often pre-blocked or heavily throttled by sites that have learned to recognize them.
HTTP header fingerprinting identifies non-browser requests. A real browser sends a distinctive set of headers: Accept, Accept-Language, Accept-Encoding, Connection, Sec-Fetch-*, and others that vary by browser version. A plain Python requests call sends a minimal header set that bot detection systems recognize immediately.
TLS fingerprinting examines the TLS handshake parameters — cipher suites, extensions, and their ordering — before any HTTP headers are evaluated. Python's requests library and Node.js https module produce TLS fingerprints that differ from Chromium's, allowing sites using services like Cloudflare to identify scrapers at the network layer before looking at any HTTP content.
Browser automation detection applies to headless browser scrapers. Sites run JavaScript checks for signals like navigator.webdriver === true, the presence of Playwright's __playwright global, Chrome's window.chrome.runtime being undefined in headless mode, and dozens of other automation artifacts that differ from a real browser session.
Layer 1: Request Headers
Set a realistic browser User-Agent and the supporting headers that real browsers send alongside it. An accurate Chrome User-Agent with missing Sec-Fetch-* headers is still a suspicious combination — bot detection systems look at the full header set:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/123.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
}
session = requests.Session()
session.headers.update(headers)
response = session.get('https://example.com')
Use a requests.Session() to maintain cookies and headers across requests to the same domain, mimicking a real browser session that navigates through multiple pages. Keep the User-Agent string updated — outdated Chrome versions can trigger suspicion on sites that track major release versions.
Layer 2: Rate Limiting and Human-Like Timing
Space requests with random delays to avoid the perfectly uniform inter-request timing that bot detection systems flag. A real user clicking through pages introduces variable delays:
import time
import random
def scrape_with_delay(urls: list[str], min_delay=1.5, max_delay=4.0):
results = []
for url in urls:
try:
response = session.get(url, timeout=15)
results.append({'url': url, 'html': response.text})
except Exception as e:
results.append({'url': url, 'error': str(e)})
# Random delay between requests
time.sleep(random.uniform(min_delay, max_delay))
return results
A 1.5 to 4 second random delay keeps request rates low enough to avoid most per-IP rate limits while still processing a meaningful number of URLs per hour. For high-value targets with aggressive rate limiting, increase the range to 5 to 15 seconds and reduce concurrency to a single thread.
Layer 3: Proxy Rotation
Residential proxy networks rotate your outgoing IP address across a pool of real consumer IPs, making each request appear to come from a different household. This addresses IP-level blocks and rate limits that target specific addresses. Commercial residential proxy providers include Bright Data, Oxylabs, and Smartproxy.
import random
PROXY_LIST = [
"http://user:pass@proxy1.provider.com:8080",
"http://user:pass@proxy2.provider.com:8080",
"http://user:pass@proxy3.provider.com:8080",
]
def get_proxy():
proxy = random.choice(PROXY_LIST)
return {'http': proxy, 'https': proxy}
response = session.get(url, proxies=get_proxy(), timeout=15)
Residential proxies add cost — typically $5 to $15 per GB of traffic. For high-volume scraping where individual IP blocks are a persistent problem, this cost is usually justified. For moderate volumes, the header and rate limiting techniques above are often sufficient.
Layer 4: Browser Fingerprint Stealth via API
For sites that require full browser fingerprint stealth — Cloudflare-protected pages, JavaScript challenge sites, and platforms with aggressive DataDome or PerimeterX deployments — delegating rendering to a managed stealth browser API eliminates the need to maintain fingerprint patches yourself:
import requests, os
def stealth_extract(url: str, selector: str) -> str:
response = requests.post(
"https://api.snapapi.pics/v1/extract",
json={"url": url, "selector": selector, "stealth": True, "wait_for": selector},
headers={"X-Api-Key": os.environ["SNAPAPI_KEY"]},
timeout=30
)
response.raise_for_status()
return response.json()["text"]
price = stealth_extract("https://protected-shop.example.com/product/123", ".price")
print(price)
SnapAPI's stealth infrastructure handles browser fingerprint randomization, TLS fingerprint matching, proxy selection, and anti-bot bypass at the infrastructure level. Your Python code stays simple — no playwright-extra plugins to install, no fingerprint patches to maintain as detection vendors update their systems.
Start scraping without getting blocked at snapapi.pics — 200 free captures per month, no credit card required. The extract and scrape endpoints with stealth mode enabled are available on all paid plans.