Web scraping in JavaScript has evolved dramatically. A few years ago, cheerio and a basic HTTP request were enough to extract data from most sites. Today, the majority of data-rich websites are JavaScript-rendered single-page applications that serve empty HTML shells and populate content via async API calls. This guide covers every approach: from simple static scraping with cheerio, to full browser automation with Playwright, to cloud-based scraping APIs that eliminate the infrastructure burden entirely.

When to Use Each Scraping Method

Choosing the right scraping tool depends on the target site and how much infrastructure you want to maintain. Here is the decision tree most teams follow:

Site TypeBest ToolNotes
Static HTMLcheerio + axiosFastest, no browser overhead, works in serverless
Light JS renderingPuppeteer or PlaywrightRequires managed browser, memory-intensive
Anti-bot protectionCloud scraping APIHandles CAPTCHA bypass, rotating IPs, stealth mode
Scale: 1000+ URLs/dayCloud scraping APIEliminates browser fleet maintenance entirely

Static Scraping with Cheerio and Axios

Cheerio is a server-side jQuery implementation that parses HTML and lets you traverse the DOM using familiar CSS selectors. Combined with axios for HTTP requests, it is the fastest way to scrape static HTML pages in Node.js. No Chromium process, no memory overhead, runs in any Node.js version.

const axios = require('axios');
const cheerio = require('cheerio');

async function scrapeHackerNews() {
  const { data } = await axios.get('https://news.ycombinator.com');
  const $ = cheerio.load(data);

  const stories = [];
  $('.athing').each((i, el) => {
    const title = $(el).find('.titleline a').first().text();
    const url = $(el).find('.titleline a').first().attr('href');
    const points = $(el).next().find('.score').text();
    stories.push({ title, url, points });
  });

  return stories;
}

scrapeHackerNews().then(s => console.log(JSON.stringify(s, null, 2)));

Cheerio works well for sites that return full HTML on the initial request. The limitation is that it cannot execute JavaScript, so React, Vue, and Angular applications that render content client-side will return empty containers.

Browser Automation with Playwright

Playwright is Microsoft's headless browser automation library, supporting Chromium, Firefox, and WebKit. It handles JavaScript rendering, waits for network requests to complete, and gives you full control over the browser via an async API. The trade-off is resource intensity: each Playwright instance consumes roughly 200MB of RAM and requires a full Chromium install.

const { chromium } = require('playwright');

async function scrapeWithPlaywright(url) {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage();

  await page.goto(url, { waitUntil: 'networkidle' });

  // Wait for dynamic content
  await page.waitForSelector('.product-price');

  const prices = await page.$$eval('.product-price', els =>
    els.map(el => el.textContent.trim())
  );

  await browser.close();
  return prices;
}

The Infrastructure Problem with Self-Hosted Browsers

Running Playwright at scale requires solving several hard infrastructure problems. You need to provision servers with sufficient RAM (plan for 300-500MB per concurrent browser), manage browser process lifecycle to prevent zombie processes, handle OS-level dependencies (libglib, libatk, and dozens of other shared libraries), and deal with websites that detect and block datacenter IP ranges.

For teams running more than a few dozen concurrent scrapes per hour, the operational cost of maintaining a Playwright cluster often exceeds the cost of a cloud scraping API subscription.

Cloud-Based Scraping with SnapAPI

SnapAPI provides a managed browser infrastructure via REST API. Pass a URL and extraction selectors, get back structured JSON. No Chromium install, no server provisioning, no proxy management.

const axios = require('axios');

async function scrapeWithSnapAPI(url, selectors) {
  const response = await axios.post(
    'https://api.snapapi.pics/v1/extract',
    {
      url,
      schema: selectors,
      block_ads: true,
      stealth: true,
      wait_for: '.product-price',
    },
    { headers: { 'X-Api-Key': process.env.SNAP_API_KEY } }
  );
  return response.data;
}

// Example: extract product data
const data = await scrapeWithSnapAPI('https://shop.example.com/products/widget', {
  title: 'h1.product-title',
  price: '.product-price',
  description: '.product-description',
  images: { selector: '.product-images img', attribute: 'src', multiple: true },
});
console.log(data);
// { title: "Widget Pro", price: "$49.99", images: ["..."] }

Handling Anti-Bot Measures in 2026

Modern anti-bot systems like Cloudflare Turnstile, Akamai Bot Manager, and PerimeterX detect browser automation through dozens of signals: TLS fingerprint, browser canvas rendering, mouse movement patterns, and JavaScript runtime quirks. Bypassing these systems manually requires constantly updating stealth patches as detection algorithms evolve.

SnapAPI's stealth mode handles this automatically. The managed Chromium instance passes common bot detection checks out of the box, and the infrastructure routes requests through residential IP pools when needed. This means you can scrape sites that would immediately block a standard Playwright setup without any additional configuration.

Rate Limiting and Polite Scraping

Responsible scraping means respecting rate limits and not overwhelming the target server. Even when using a cloud API, implement delays between requests to the same domain. A safe default is one request per second per domain, or respecting the Crawl-delay directive in robots.txt if present. Use jitter (randomised delay) to avoid creating obvious periodic request patterns that trigger rate limit detection.

const delay = ms => new Promise(resolve => setTimeout(resolve, ms));
const jitter = () => Math.floor(Math.random() * 1000);

async function batchScrape(urls) {
  const results = [];
  for (const url of urls) {
    results.push(await scrapeWithSnapAPI(url, schema));
    await delay(1000 + jitter()); // 1-2s between requests
  }
  return results;
}

Storing and Processing Scraped Data

Once you have raw scraped data, structure it and store it efficiently. For price monitoring and time-series data, PostgreSQL with a timestamptz column is a reliable choice. For document-heavy scraping like news articles or job listings, consider a document store like MongoDB or a search engine like Meilisearch that lets you query and rank by content fields.

Normalise currency, dates, and numeric fields before storage. Scraped price strings like "$1,299.99" or "EUR 49,90" need consistent parsing before you can run queries across them. The JavaScript Intl.NumberFormat API handles many common locale formats, but write unit tests for the edge cases that will inevitably appear in production data.

Getting Started with SnapAPI Scraping

Create a free SnapAPI account to get 200 monthly extractions included. No credit card required. The free tier is enough to evaluate the API against your target sites and build your extraction schema before committing to a paid plan. The Starter plan at nineteen dollars per month includes five thousand monthly requests, sufficient for most small-scale monitoring workflows.

Start Scraping Without the Browser Headaches

200 free extractions per month. Stealth mode, proxy rotation, and JS rendering included.

Create Free Account