Anti-bot technology has gotten sophisticated. Here's what the modern scraping stack looks like and how to build scrapers that survive Cloudflare, DataDome, and residential IP blocking.
Web scraping has never been harder — or more valuable. The amount of data locked behind JavaScript-rendered pages, paywalls, and dynamic content has grown dramatically. Simultaneously, anti-bot technology has advanced: Cloudflare's bot management, DataDome, PerimeterX, and similar systems can fingerprint browsers, analyze behavioral patterns, and block scrapers with high accuracy.
This doesn't mean scraping is dead. It means naive scrapers are dead. Simple Python requests with a spoofed User-Agent get blocked immediately. Even basic Selenium and Puppeteer scripts get flagged by fingerprinting systems that detect automation indicators. Effective scraping in 2026 requires understanding what detection systems look for and systematically defeating those signals.
Modern bot detection operates at multiple layers. At the TLS/network layer, detection systems fingerprint your TLS client hello, looking for patterns that differ from real browsers. Scrapers using Python requests or Go's default HTTP client have different TLS fingerprints than Chrome — detection systems can identify this before they even see your browser fingerprint.
At the browser layer, headless Chrome can be detected through dozens of signals: navigator.webdriver being true, missing browser plugins, inconsistent screen dimensions, the absence of real mouse movement history, WebRTC IP leaks, canvas fingerprint anomalies, and more. Playwright and Puppeteer in default configuration fail most fingerprint tests.
At the behavioral layer, request timing patterns, scroll events, and mouse movement distributions differ between humans and bots. Systems like DataDome collect this behavioral data and use ML models to classify sessions.
For most scraping use cases, a managed API like SnapAPI handles the fingerprinting, TLS spoofing, and residential IP rotation for you. You send a URL; you get back scraped content or a screenshot. The infrastructure handles all the anti-detection complexity. SnapAPI's scrape endpoint returns full page HTML including JavaScript-rendered content, with our stealth browser layer handling fingerprinting automatically.
import requests
result = requests.get("https://snapapi.pics/v1/scrape", params={
"access_key": "your_key",
"url": "https://target-site.com/products",
"wait_for": ".product-listing", # CSS selector to wait for
}).json()
html = result["html"] # fully rendered page HTML
For complex scrapers requiring multi-step interactions, use Playwright with playwright-stealth. This patches the most commonly detected browser automation signals: navigator.webdriver, chrome.runtime, plugins array, and others. Combine with residential proxies for IP diversity.
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
await stealth_async(page) # patch detection signals
await page.goto(target_url)
content = await page.content()
For sites that don't require JavaScript rendering, curl-cffi in Python impersonates real browser TLS fingerprints. This defeats TLS-level detection while being much faster and lighter than a full headless browser. It handles most Cloudflare challenges that don't require JavaScript execution.
from curl_cffi import requests as cfrequests
# Impersonate Chrome 120 TLS fingerprint
response = cfrequests.get(url, impersonate="chrome120")
html = response.text
Check robots.txt before scraping any site. Many sites explicitly allow scraping of their public data — respecting these rules keeps your scraper running longer and avoids legal issues. For scraping data you plan to use commercially, consult with a lawyer familiar with your jurisdiction. The legality of web scraping varies significantly by country and use case.
Aggressive scraping gets you blocked and can harm target sites. Add delays between requests (1–5 seconds is generally safe), respect retry-after headers, and never scrape at rates that could constitute a denial-of-service attack. Use exponential backoff when you receive 429 or 503 responses. Managed APIs like SnapAPI handle this automatically, throttling your requests appropriately.
Full JavaScript rendering, stealth browser, no bot detection headaches. Free tier includes 200 requests per month.
Get Your Free API KeyIP-based blocking is one of the most common anti-scraping measures. Datacenter proxies (from providers like Bright Data's datacenter pool or Oxylabs) are fast and cheap but have known IP ranges that many sites block outright. Residential proxies route your traffic through real consumer ISP connections, making your requests look like they come from home users. They're more expensive (typically $5–15/GB) but work on sites that block datacenters.
For most scraping projects, start with datacenter proxies and switch to residential only when you hit blocks. Many sites never bother blocking datacenters for low-volume scraping. The cost difference is significant — residential proxies for high-volume scraping can cost hundreds of dollars per month. Use them only where necessary.
Managed scraping APIs like SnapAPI handle proxy rotation automatically as part of the service. You don't manage IP pools, rotation logic, or proxy provider accounts — you just send the URL and get back content. For simple scraping use cases, this is almost always cheaper than managing your own proxy infrastructure.
CAPTCHAs are the final defense layer for many sites. hCaptcha and reCAPTCHA v3 are behavioral — they score user interactions rather than presenting challenges. Passing these requires genuine-looking browser behavior: realistic mouse movements, natural timing between page loads, and a consistent browsing history. Automated CAPTCHA solving services like 2captcha and Anti-Captcha can solve image CAPTCHAs but don't help with behavioral scores.
The most reliable approach for scraping CAPTCHA-protected sites is using a managed browser service with built-in CAPTCHA handling, or simply avoiding those sites if possible. For public data that doesn't require authentication, many sites only show CAPTCHAs to bots — a well-configured stealth browser often passes them without any CAPTCHA solving service.
Once you have scraped HTML, you need to extract structured data from it. BeautifulSoup and lxml are the standard Python tools. For repeatable extraction from consistent page structures, define CSS selectors or XPath expressions as configuration — this makes your scraper maintainable when the target page structure changes slightly. Store raw HTML alongside extracted data so you can re-parse when your extraction logic improves without re-scraping.
from bs4 import BeautifulSoup
def extract_products(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
return [
{
"name": card.select_one(".product-title").get_text(strip=True),
"price": card.select_one(".price").get_text(strip=True),
"url": card.select_one("a")["href"],
}
for card in soup.select(".product-card")
]
SnapAPI's scrape endpoint is the right choice when you need JavaScript-rendered page content and don't want to manage a browser. Send the URL, get back fully rendered HTML including content loaded by JavaScript after page load. Use it for product pages, search results, social content, and any page where the data you need is injected by client-side JavaScript. For static HTML pages, a simple requests call is faster — but for the modern web where most content is JavaScript-rendered, SnapAPI's managed browser removes all the complexity. Try it free — 200 requests per month, no credit card.
Effective scraping requires not just technical skill but also ethical judgment. Before scraping any site, review their Terms of Service. Many sites explicitly prohibit scraping, and violating ToS can create legal exposure even when data is publicly accessible. The Computer Fraud and Abuse Act in the US, and similar laws in other jurisdictions, may apply depending on how scraping is conducted.
Always identify yourself with a meaningful User-Agent string that includes a contact URL or email address. This allows site operators to reach you if your scraper causes problems, rather than escalating immediately to IP bans or legal action. Avoid scraping at rates that affect site performance — one request every two to five seconds is generally considered polite for public data.
For large-scale data collection, check whether the site offers an official API or data export. Official channels provide cleaner, more reliable data and eliminate legal uncertainty. Many sites that appear to require scraping actually have undocumented APIs that their own frontend uses — inspect network requests before building a scraper.
SnapAPI's scrape endpoint is ideal when you need JavaScript-rendered content without managing browser infrastructure. The stealth browser layer handles fingerprinting automatically. You get back the fully rendered HTML of the page after all JavaScript has executed and network requests have completed. This covers the vast majority of modern web pages where content is injected by client-side JavaScript. For static HTML pages, a simple HTTP client is faster and cheaper. For pages requiring complex multi-step interaction — form submission, clicking through pagination, handling modal dialogs — you may need a full browser automation solution like Playwright. But for read-only data extraction from JavaScript-rendered pages, SnapAPI's scrape endpoint is the fastest path from URL to structured data. Try it free — 200 requests per month, no credit card required.