JavaScript Web Scraping: From Basic Fetch to Full Browser Automation
A complete guide to scraping the web with Node.js — static pages with fetch and Cheerio, JavaScript-rendered sites with Playwright, and delegating browser rendering to an API.
When to Use fetch and Cheerio vs a Full Browser
Not every web scraping task requires a headless browser. Static HTML pages — those where the content is fully present in the initial server response — can be scraped with a plain HTTP request and an HTML parser. This approach is faster, cheaper, and easier to scale than launching Chromium for every request. If the data you need appears when you view page source in your browser, you do not need a browser to scrape it.
Cheerio is the most popular Node.js library for parsing HTML. It implements a jQuery-like selector API that makes CSS-based extraction familiar to anyone with front-end experience. Here is a minimal example that extracts all article headlines from a static news page:
const cheerio = require('cheerio');
async function scrapeHeadlines(url) {
const response = await fetch(url, {
headers: { 'User-Agent': 'Mozilla/5.0 (compatible; Scraper/1.0)' }
});
const html = await response.text();
const $ = cheerio.load(html);
const headlines = [];
$('h2 a, h3 a, .article-title a').each((_, el) => {
headlines.push({ text: $(el).text().trim(), href: $(el).attr('href') });
});
return headlines;
}
scrapeHeadlines('https://news.ycombinator.com')
.then(h => console.log(h.slice(0, 5)));
This pattern works well for news aggregators, product catalogues backed by server-side rendering, Wikipedia pages, and most documentation sites. The limitation appears when the content you want is rendered by JavaScript after the initial HTML payload arrives — React, Vue, and Angular single-page applications fall into this category, as do many modern e-commerce product detail pages that load pricing and inventory via client-side API calls.
Scraping JavaScript-Rendered Pages with Playwright
Playwright is Microsoft's headless browser automation library for Node.js. It launches and controls Chromium, Firefox, or WebKit, executes all JavaScript on the page, and then lets you query the fully rendered DOM. For sites that require client-side rendering, Playwright provides precise control over the browser lifecycle:
const { chromium } = require('playwright');
async function scrapeWithPlaywright(url) {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto(url, { waitUntil: 'networkidle' });
// Wait for dynamic content to load
await page.waitForSelector('.product-price');
const prices = await page.$$eval('.product-price', els =>
els.map(el => el.textContent.trim())
);
await browser.close();
return prices;
}
The waitUntil: 'networkidle' option waits until there are no more than two pending network requests for 500 milliseconds — a reliable signal that the initial round of client-side data fetching is complete. The waitForSelector call adds a safety check to ensure the specific element you need is actually present before querying.
The operational downside of running Playwright in-process is resource consumption. Each browser instance typically requires 200 to 400 megabytes of RAM and significant CPU time to render and execute JavaScript. Running more than a few concurrent instances on a standard server becomes expensive quickly, and managing browser process lifecycles — handling crashes, memory leaks, and zombie processes — adds real operational complexity to production deployments.
Delegating Browser Rendering to a Scraping API
For teams that need JavaScript rendering without managing browser infrastructure, a scraping API handles the heavy lifting remotely. You send an HTTP request with the target URL and your requirements, and the API executes the page in a fully managed Chromium instance on its infrastructure, then returns rendered HTML, structured JSON, or a screenshot depending on which endpoint you call.
SnapAPI's scrape endpoint returns the fully rendered HTML of any page after JavaScript execution completes. The extract endpoint goes further: you pass a CSS selector and receive the matched text or attribute values directly, skipping the HTML parsing step entirely:
const response = await fetch('https://api.snapapi.pics/v1/extract', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'X-Api-Key': process.env.SNAPAPI_KEY
},
body: JSON.stringify({
url: 'https://shop.example.com/product/123',
selector: '.product-price',
wait_for: '.product-price'
})
});
const { text } = await response.json();
console.log('Price:', text); // "$49.99"
This replaces the entire Playwright setup — browser installation, launch time, page lifecycle management, and cleanup — with a single fetch call that works identically in Node.js, Deno, Bun, or any environment that supports the Fetch API. The remote API handles proxy rotation, anti-bot bypass, JavaScript execution, and retry logic transparently.
Handling Anti-Bot Measures in JavaScript Scrapers
Many commercial websites deploy anti-bot systems including Cloudflare Turnstile, DataDome, PerimeterX, and custom fingerprinting scripts. These systems detect headless browsers by examining browser feature flags, TLS fingerprints, JavaScript execution timing, and behavioral signals that differ between real user sessions and automated scripts.
When using Playwright directly, you can enable stealth mode with playwright-extra and puppeteer-extra-plugin-stealth. These plugins patch common detection vectors including the navigator.webdriver property, Chrome runtime checks, and Canvas API fingerprinting. However, stealth plugins require frequent updates as detection vendors continuously evolve their checks, making maintenance an ongoing responsibility.
SnapAPI's stealth mode is maintained at the infrastructure level — updates are invisible to callers. Pass "stealth": true in your extract or scrape request body and the API selects a browser profile optimized to bypass detection on the target site without any plugin management on your end.
Batch Scraping with Concurrency Control in Node.js
When processing lists of URLs, concurrency dramatically reduces total execution time. Node.js handles many concurrent HTTP requests efficiently through its event loop. The key is capping parallelism to avoid overwhelming the target server or exceeding API rate limits:
async function scrapeAll(urls, concurrency = 5) {
const results = [];
const queue = [...urls];
async function worker() {
while (queue.length > 0) {
const url = queue.shift();
const res = await fetch('https://api.snapapi.pics/v1/extract', {
method: 'POST',
headers: {
'X-Api-Key': process.env.SNAPAPI_KEY,
'Content-Type': 'application/json'
},
body: JSON.stringify({ url, selector: 'h1' })
});
const data = await res.json();
results.push({ url, title: data.text });
}
}
await Promise.all(Array.from({ length: concurrency }, worker));
return results;
}
This worker pool pattern processes the URL queue as fast as the API responds, maintaining exactly concurrency parallel connections at all times. Tune the concurrency value based on your plan's rate limits — the SnapAPI Starter plan supports up to 5 concurrent requests, Pro supports 20, and Business supports unlimited concurrency.
Get started with SnapAPI at snapapi.pics — 200 free extractions per month, no credit card required. The API handles browser rendering, JavaScript execution, stealth mode, and proxy rotation so your Node.js scraper stays lean and focused on your business logic rather than browser infrastructure.