GuidesApril 4, 2026

JavaScript Web Scraping: From Basic Fetch to Full Browser Automation

A complete guide to scraping the web with Node.js — static pages with fetch and Cheerio, JavaScript-rendered sites with Playwright, and delegating browser rendering to an API.

When to Use fetch and Cheerio vs a Full Browser

Not every web scraping task requires a headless browser. Static HTML pages — those where the content is fully present in the initial server response — can be scraped with a plain HTTP request and an HTML parser. This approach is faster, cheaper, and easier to scale than launching Chromium for every request. If the data you need appears when you view page source in your browser, you do not need a browser to scrape it.

Cheerio is the most popular Node.js library for parsing HTML. It implements a jQuery-like selector API that makes CSS-based extraction familiar to anyone with front-end experience. Here is a minimal example that extracts all article headlines from a static news page:

const cheerio = require('cheerio');

async function scrapeHeadlines(url) {
  const response = await fetch(url, {
    headers: { 'User-Agent': 'Mozilla/5.0 (compatible; Scraper/1.0)' }
  });
  const html = await response.text();
  const $ = cheerio.load(html);
  const headlines = [];
  $('h2 a, h3 a, .article-title a').each((_, el) => {
    headlines.push({ text: $(el).text().trim(), href: $(el).attr('href') });
  });
  return headlines;
}

scrapeHeadlines('https://news.ycombinator.com')
  .then(h => console.log(h.slice(0, 5)));

This pattern works well for news aggregators, product catalogues backed by server-side rendering, Wikipedia pages, and most documentation sites. The limitation appears when the content you want is rendered by JavaScript after the initial HTML payload arrives — React, Vue, and Angular single-page applications fall into this category, as do many modern e-commerce product detail pages that load pricing and inventory via client-side API calls.

Scraping JavaScript-Rendered Pages with Playwright

Playwright is Microsoft's headless browser automation library for Node.js. It launches and controls Chromium, Firefox, or WebKit, executes all JavaScript on the page, and then lets you query the fully rendered DOM. For sites that require client-side rendering, Playwright provides precise control over the browser lifecycle:

const { chromium } = require('playwright');

async function scrapeWithPlaywright(url) {
  const browser = await chromium.launch();
  const page = await browser.newPage();
  await page.goto(url, { waitUntil: 'networkidle' });

  // Wait for dynamic content to load
  await page.waitForSelector('.product-price');

  const prices = await page.$$eval('.product-price', els =>
    els.map(el => el.textContent.trim())
  );

  await browser.close();
  return prices;
}

The waitUntil: 'networkidle' option waits until there are no more than two pending network requests for 500 milliseconds — a reliable signal that the initial round of client-side data fetching is complete. The waitForSelector call adds a safety check to ensure the specific element you need is actually present before querying.

The operational downside of running Playwright in-process is resource consumption. Each browser instance typically requires 200 to 400 megabytes of RAM and significant CPU time to render and execute JavaScript. Running more than a few concurrent instances on a standard server becomes expensive quickly, and managing browser process lifecycles — handling crashes, memory leaks, and zombie processes — adds real operational complexity to production deployments.

Delegating Browser Rendering to a Scraping API

For teams that need JavaScript rendering without managing browser infrastructure, a scraping API handles the heavy lifting remotely. You send an HTTP request with the target URL and your requirements, and the API executes the page in a fully managed Chromium instance on its infrastructure, then returns rendered HTML, structured JSON, or a screenshot depending on which endpoint you call.

SnapAPI's scrape endpoint returns the fully rendered HTML of any page after JavaScript execution completes. The extract endpoint goes further: you pass a CSS selector and receive the matched text or attribute values directly, skipping the HTML parsing step entirely:

const response = await fetch('https://api.snapapi.pics/v1/extract', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'X-Api-Key': process.env.SNAPAPI_KEY
  },
  body: JSON.stringify({
    url: 'https://shop.example.com/product/123',
    selector: '.product-price',
    wait_for: '.product-price'
  })
});
const { text } = await response.json();
console.log('Price:', text); // "$49.99"

This replaces the entire Playwright setup — browser installation, launch time, page lifecycle management, and cleanup — with a single fetch call that works identically in Node.js, Deno, Bun, or any environment that supports the Fetch API. The remote API handles proxy rotation, anti-bot bypass, JavaScript execution, and retry logic transparently.

Handling Anti-Bot Measures in JavaScript Scrapers

Many commercial websites deploy anti-bot systems including Cloudflare Turnstile, DataDome, PerimeterX, and custom fingerprinting scripts. These systems detect headless browsers by examining browser feature flags, TLS fingerprints, JavaScript execution timing, and behavioral signals that differ between real user sessions and automated scripts.

When using Playwright directly, you can enable stealth mode with playwright-extra and puppeteer-extra-plugin-stealth. These plugins patch common detection vectors including the navigator.webdriver property, Chrome runtime checks, and Canvas API fingerprinting. However, stealth plugins require frequent updates as detection vendors continuously evolve their checks, making maintenance an ongoing responsibility.

SnapAPI's stealth mode is maintained at the infrastructure level — updates are invisible to callers. Pass "stealth": true in your extract or scrape request body and the API selects a browser profile optimized to bypass detection on the target site without any plugin management on your end.

Batch Scraping with Concurrency Control in Node.js

When processing lists of URLs, concurrency dramatically reduces total execution time. Node.js handles many concurrent HTTP requests efficiently through its event loop. The key is capping parallelism to avoid overwhelming the target server or exceeding API rate limits:

async function scrapeAll(urls, concurrency = 5) {
  const results = [];
  const queue = [...urls];

  async function worker() {
    while (queue.length > 0) {
      const url = queue.shift();
      const res = await fetch('https://api.snapapi.pics/v1/extract', {
        method: 'POST',
        headers: {
          'X-Api-Key': process.env.SNAPAPI_KEY,
          'Content-Type': 'application/json'
        },
        body: JSON.stringify({ url, selector: 'h1' })
      });
      const data = await res.json();
      results.push({ url, title: data.text });
    }
  }

  await Promise.all(Array.from({ length: concurrency }, worker));
  return results;
}

This worker pool pattern processes the URL queue as fast as the API responds, maintaining exactly concurrency parallel connections at all times. Tune the concurrency value based on your plan's rate limits — the SnapAPI Starter plan supports up to 5 concurrent requests, Pro supports 20, and Business supports unlimited concurrency.

Get started with SnapAPI at snapapi.pics — 200 free extractions per month, no credit card required. The API handles browser rendering, JavaScript execution, stealth mode, and proxy rotation so your Node.js scraper stays lean and focused on your business logic rather than browser infrastructure.

Parsing and Structuring Scraped Data

Once you have raw HTML or extracted text, you typically need to parse it into structured records for storage or analysis. For HTML returned by SnapAPI's scrape endpoint, load it into Cheerio on your server and apply your selectors:

const cheerio = require('cheerio');

async function scrapeProductPage(url) {
  const res = await fetch('https://api.snapapi.pics/v1/scrape', {
    method: 'POST',
    headers: { 'X-Api-Key': process.env.SNAPAPI_KEY, 'Content-Type': 'application/json' },
    body: JSON.stringify({ url, wait_for: '.product-details' })
  });
  const { html } = await res.json();
  const $ = cheerio.load(html);

  return {
    title: $('h1.product-title').text().trim(),
    price: $('.product-price').first().text().trim(),
    rating: $('[data-rating]').attr('data-rating'),
    images: $('img.product-image').map((_, el) => $(el).attr('src')).get()
  };
}

This pattern separates browser rendering concerns from data extraction logic. SnapAPI handles the Chromium instance and JavaScript execution; your code handles the domain-specific parsing. Testing extraction logic is straightforward — save a sample HTML response to a fixture file and unit test your selectors without making API calls.

Storing Results in a Database

For scraping pipelines that run on a schedule, store results in PostgreSQL or SQLite and track changes over time. A simple schema with a url, scraped_at, and a JSONB data column handles most use cases while preserving the full history of each scrape run. Compare the latest scrape against the previous record to detect price changes, content updates, or layout modifications that indicate the target site has changed its HTML structure.

Pair the scrape endpoint with the screenshot endpoint to build a visual archive alongside your structured data. When a price change is detected, capture a screenshot for visual confirmation — useful for audit logs and for debugging selector failures when the site redesigns its layout.

Error Handling and Retry Logic for Web Scrapers

Production scrapers encounter errors constantly: target sites go down, return unexpected status codes, change their structure without notice, or block requests intermittently. Robust error handling is the difference between a scraper that runs reliably for months and one that silently stops producing data.

Implement exponential backoff for transient failures — network timeouts, 503 errors, and rate limit responses. Log both the error and the URL that triggered it so you can diagnose selector failures separately from infrastructure problems. Track success rates per domain and alert when a target site's success rate drops below a threshold, which typically indicates a layout change that broke your selectors.

SnapAPI returns standard HTTP status codes that map cleanly to retry logic: 200 for success, 400 for invalid parameters, 401 for an incorrect API key, 429 for rate limiting with a Retry-After header, and 5xx for transient infrastructure errors. Only 429 and 5xx responses should trigger retries — 400 and 401 indicate a problem with your request that retrying will not fix.

Start building your JavaScript scraper with SnapAPI at snapapi.pics. 200 free requests per month, no credit card needed, and the API is live in production with comprehensive documentation at snapapi.pics/docs.html.