Web Scraping Without Getting Blocked (2026 Guide)
Getting blocked while scraping is frustrating — especially when your scraper worked yesterday. Modern bot detection has layered defenses: rate limiting, IP reputation databases, JavaScript fingerprinting, behavioural analysis, and CAPTCHAs. This guide breaks down exactly how each detection method works and gives you the countermeasures.
How Websites Detect Bots
| Detection method | What it checks | Countermeasure |
|---|---|---|
| Rate limiting | Request frequency per IP | Delays + proxy rotation |
| IP reputation | Datacenter / proxy IP ranges | Residential proxies |
| User-Agent | Headless Chrome UA string | Realistic UA rotation |
| navigator.webdriver | Automation flag in JS | CDP patch / stealth plugin |
| Browser fingerprint | Canvas, WebGL, fonts, plugins | Stealth evasion / real browsers |
| Behavioural analysis | Mouse movement, scroll, timing | Human-like interaction simulation |
| Honeypot links | Hidden links only bots follow | Check visibility before clicking |
| CAPTCHA | Human verification challenge | 2captcha / avoid triggering |
Rate Limiting and Polite Delays
The single most common reason scrapers get blocked is too many requests too fast. A human browsing a site averages 3–10 seconds between page loads. Your scraper should be similar.
// Configurable delay between requests
function sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
// Jitter: random delay between min and max ms
function jitter(min = 1000, max = 4000) {
return sleep(Math.floor(Math.random() * (max - min) + min));
}
async function scrapePagesPolitely(urls) {
const results = [];
for (const url of urls) {
try {
const data = await fetchPage(url);
results.push(data);
} catch (err) {
console.error(`Failed: ${url}`, err.message);
}
await jitter(1500, 5000); // wait 1.5–5s between requests
}
return results;
}
// p-queue for controlled concurrency
const PQueue = require('p-queue');
const queue = new PQueue({ concurrency: 2, interval: 1000, intervalCap: 2 });
const tasks = urls.map(url => () => queue.add(() => fetchPage(url)));
const results = await Promise.all(tasks.map(t => t()));
User-Agent Rotation
Default axios/requests user-agents scream "bot". Always set a realistic Chrome or Firefox UA — and rotate it to avoid pattern detection.
const USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15',
];
function randomUA() {
return USER_AGENTS[Math.floor(Math.random() * USER_AGENTS.length)];
}
const axios = require('axios');
async function fetchPage(url) {
return axios.get(url, {
headers: {
'User-Agent': randomUA(),
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Cache-Control': 'no-cache',
'Pragma': 'no-cache',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
},
timeout: 15000
});
}
Accept-Language and Accept headers to realistic browser values. Detection systems look at the full header fingerprint, not just User-Agent.
Proxy Rotation
Datacenter proxies (cheap) are easily detected by IP reputation databases. For protected sites, residential or mobile proxies are more reliable — they route traffic through real consumer IPs.
const axios = require('axios');
const { HttpsProxyAgent } = require('https-proxy-agent');
const PROXIES = [
'http://user:pass@proxy1.provider.com:8080',
'http://user:pass@proxy2.provider.com:8080',
'http://user:pass@proxy3.provider.com:8080',
];
function getRandomProxy() {
return PROXIES[Math.floor(Math.random() * PROXIES.length)];
}
async function fetchWithProxy(url, retries = 3) {
for (let i = 0; i < retries; i++) {
const proxyUrl = getRandomProxy();
try {
const agent = new HttpsProxyAgent(proxyUrl);
const { data } = await axios.get(url, {
httpsAgent: agent,
headers: { 'User-Agent': randomUA() },
timeout: 20000
});
return data;
} catch (err) {
if (i === retries - 1) throw err;
await sleep(2000 * (i + 1)); // exponential backoff
}
}
}
Playwright Stealth Mode
A headless Playwright browser has dozens of tell-tale fingerprints. The most important one to fix is navigator.webdriver = true, which is set by default in automation contexts. Here are the essential CDP patches:
const { chromium } = require('playwright');
async function stealthScrape(url) {
const browser = await chromium.launch({
headless: true,
args: [
'--disable-blink-features=AutomationControlled',
'--disable-dev-shm-usage',
'--no-sandbox',
'--disable-setuid-sandbox',
]
});
const context = await browser.newContext({
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36',
viewport: { width: 1366, height: 768 },
locale: 'en-US',
timezoneId: 'America/New_York',
permissions: ['geolocation'],
extraHTTPHeaders: { 'Accept-Language': 'en-US,en;q=0.9' }
});
const page = await context.newPage();
// Patch automation fingerprints via CDP
await page.addInitScript(() => {
Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
Object.defineProperty(navigator, 'plugins', { get: () => [1, 2, 3, 4, 5] });
Object.defineProperty(navigator, 'languages', { get: () => ['en-US', 'en'] });
window.chrome = { runtime: {} };
});
// Block fingerprinting scripts
await page.route('**/*', route => {
const url = route.request().url();
if (/fingerprintjs|botd|datadome|perimeterx/.test(url)) return route.abort();
route.continue();
});
await page.goto(url, { waitUntil: 'networkidle', timeout: 30000 });
// Human-like: random scroll before extracting
await page.evaluate(() => window.scrollBy(0, Math.floor(Math.random() * 400 + 200)));
await page.waitForTimeout(Math.floor(Math.random() * 1000 + 500));
const html = await page.content();
await browser.close();
return html;
}
npm install puppeteer-extra puppeteer-extra-plugin-stealth. For full details, see our Puppeteer stealth guide.
SnapAPI Stealth Mode (Managed Solution)
Managing proxies, rotating fingerprints, and patching CDP is a full-time job. SnapAPI's stealth: true parameter handles all of it — residential IP rotation, fingerprint randomisation, and human-like behaviour — in a single API call.
const axios = require('axios');
async function stealthScrapeAPI(url) {
const { data } = await axios.post('https://api.snapapi.pics/v1/scrape', {
url,
stealth: true, // residential proxy + fingerprint randomisation
blockAds: true,
blockCookieBanners: true,
waitFor: 'networkidle'
}, { headers: { 'X-Api-Key': process.env.SNAPAPI_KEY } });
return data.html; // fully rendered HTML, ready for Cheerio/node-html-parser
}
// Same for screenshots
async function stealthScreenshot(url) {
const { data } = await axios.post('https://api.snapapi.pics/v1/screenshot', {
url,
stealth: true,
fullPage: true,
format: 'webp'
}, { headers: { 'X-Api-Key': process.env.SNAPAPI_KEY } });
return Buffer.from(data.screenshot, 'base64');
}
SnapAPI also exposes a Python SDK for the same stealth requests:
import httpx, os
async def stealth_scrape(url: str) -> str:
async with httpx.AsyncClient() as client:
r = await client.post(
'https://api.snapapi.pics/v1/scrape',
json={'url': url, 'stealth': True, 'blockAds': True, 'waitFor': 'networkidle'},
headers={'X-Api-Key': os.environ['SNAPAPI_KEY']},
timeout=60
)
r.raise_for_status()
return r.json()['html']
Avoiding Honeypot Traps
Honeypots are invisible links or form fields that only automated tools interact with. Following a honeypot link flags your IP immediately.
const cheerio = require('cheerio');
function getVisibleLinks(html, baseUrl) {
const $ = cheerio.load(html);
return $('a[href]').map((_, el) => {
const $el = $(el);
const style = $el.attr('style') ?? '';
const cls = $el.attr('class') ?? '';
// Skip hidden links (honeypots)
const isHidden = (
style.includes('display:none') ||
style.includes('display: none') ||
style.includes('visibility:hidden') ||
style.includes('opacity:0') ||
style.includes('font-size:0') ||
cls.includes('hidden') ||
cls.includes('invisible')
);
if (isHidden) return null;
const href = $el.attr('href');
if (!href || href.startsWith('javascript:') || href === '#') return null;
try { return new URL(href, baseUrl).href; } catch { return null; }
}).get().filter(Boolean);
}
element.isVisible() before clicking.
Anti-Block Checklist
- ✓ Set realistic User-Agent and browser-like headers
- ✓ Add random delays (1–5s) between requests
- ✓ Limit concurrency to 2–3 parallel requests per domain
- ✓ Use residential proxies for well-protected sites
- ✓ Patch
navigator.webdriverand other automation signals - ✓ Simulate human behaviour: random scroll, mouse movement, viewport resize
- ✓ Skip hidden elements and honeypot links
- ✓ Respect
robots.txtandCrawl-delaydirectives - ✓ Retry with exponential backoff on 429 and 503 responses
- ✓ Monitor your error rate — rising 403s mean you're being detected
Skip the cat-and-mouse game
SnapAPI handles proxies, stealth mode, and fingerprint rotation for you. One stealth: true parameter — that's it. 200 free requests/month.