Guide March 3, 2026 13 min read

Extracting Structured Data from Websites: A Developer's Guide

The web is the largest database ever created, but most of it is locked behind HTML markup. Extracting useful, structured data from web pages is a fundamental task for AI pipelines, competitive intelligence, market research, content aggregation, and dozens of other applications. The challenge is doing it reliably, at scale, and without building a fragile scraping infrastructure.

This guide covers modern approaches to web data extraction, with practical examples using SnapAPI's /v1/scrape and /v1/extract endpoints.

Two Approaches to Web Data Extraction

There are two fundamentally different ways to get data from a webpage:

1. Structured Scraping (CSS/XPath Selectors)

You specify exactly which elements you want using CSS selectors or XPath expressions. The API returns structured JSON with the data from those elements. This approach is precise and fast, but requires you to know the page structure in advance.

2. Content Extraction (URL to Markdown/Text)

You point the API at a URL, and it extracts the main content as clean markdown or plain text. No selectors needed. This is ideal for AI/LLM pipelines where you want to feed web content into a language model without HTML noise.

SnapAPI supports both approaches. Let us look at each one in detail.

Structured Scraping with /v1/scrape

The /v1/scrape endpoint renders a page in a real browser (handling JavaScript, SPAs, and dynamic content) and then extracts data using CSS selectors you define.

Basic Example: Extracting Page Metadata

curl "https://api.snapapi.pics/v1/scrape"   -H "Authorization: Bearer YOUR_API_KEY"   -H "Content-Type: application/json"   -d '{
    "url": "https://news.ycombinator.com",
    "selectors": {
      "title": "title",
      "top_stories": ".titleline > a",
      "scores": ".score"
    }
  }'

The response is clean JSON:

{
  "success": true,
  "data": {
    "title": "Hacker News",
    "top_stories": [
      "Show HN: I built a screenshot API that does 5 things",
      "PostgreSQL 17 is now available",
      "The state of WebAssembly in 2026"
    ],
    "scores": [
      "342 points",
      "287 points",
      "201 points"
    ]
  },
  "metadata": {
    "url": "https://news.ycombinator.com",
    "statusCode": 200,
    "loadTime": 1243
  }
}

Advanced Selectors: Extracting Attributes and Nested Data

You are not limited to text content. Extract attributes, HTML, and nested structures:

from snapapi import SnapAPI

client = SnapAPI("YOUR_API_KEY")

result = client.scrape(
    url="https://example-ecommerce.com/products",
    selectors={
        # Extract text content (default)
        "product_names": ".product-card h3",

        # Extract specific attributes with @attr syntax
        "product_images": ".product-card img@src",
        "product_links": ".product-card a@href",

        # Extract prices
        "prices": ".product-card .price",

        # Extract data attributes
        "product_ids": ".product-card@data-product-id",
    }
)

# Build structured product data
products = []
for i in range(len(result["data"]["product_names"])):
    products.append({
        "name": result["data"]["product_names"][i],
        "price": result["data"]["prices"][i],
        "image": result["data"]["product_images"][i],
        "link": result["data"]["product_links"][i],
        "id": result["data"]["product_ids"][i],
    })

print(f"Found {len(products)} products")

Handling Dynamic Content

Many modern websites load content dynamically via JavaScript. SnapAPI renders pages in a real browser, so JavaScript-rendered content is available. You can also wait for specific elements to appear:

# Wait for dynamic content to load
result = client.scrape(
    url="https://spa-app.com/dashboard",
    selectors={
        "metrics": ".metric-value",
        "chart_labels": ".chart-label",
    },
    wait_for_selector=".metric-value",  # Wait until this element exists
    delay=2000,  # Additional 2-second wait after element appears
)

Pagination: Scraping Multiple Pages

all_products = []

for page_num in range(1, 11):  # Scrape 10 pages
    url = f"https://example-shop.com/products?page={page_num}"

    result = client.scrape(
        url=url,
        selectors={
            "names": ".product-name",
            "prices": ".product-price",
            "ratings": ".product-rating@data-score",
        }
    )

    page_products = result["data"]["names"]
    if not page_products:
        break  # No more results

    for i in range(len(page_products)):
        all_products.append({
            "name": result["data"]["names"][i],
            "price": result["data"]["prices"][i],
            "rating": result["data"]["ratings"][i] if i < len(result["data"]["ratings"]) else None,
        })

    print(f"Page {page_num}: found {len(page_products)} products")

print(f"
Total: {len(all_products)} products scraped")

Content Extraction with /v1/extract

The /v1/extract endpoint is designed for a different use case: converting an entire webpage into clean, readable text or markdown. This is the endpoint you want when building AI/LLM data pipelines.

Why Content Extraction Matters for AI

Large language models work with text, not HTML. If you feed raw HTML into a language model, most of the token budget is wasted on tags, attributes, scripts, and styling. Content extraction strips all of that away, giving you just the meaningful content in a format the model can work with efficiently.

Basic Content Extraction

curl "https://api.snapapi.pics/v1/extract?url=https://example.com/blog-post&format=markdown"   -H "Authorization: Bearer YOUR_API_KEY"

Response:

{
  "success": true,
  "content": "# Blog Post Title

This is the main content of the blog post, extracted as clean markdown...

## Section Heading

More content here with **bold** and *italic* formatting preserved.

- List item 1
- List item 2
- List item 3
",
  "metadata": {
    "title": "Blog Post Title",
    "description": "Meta description of the page",
    "author": "Author Name",
    "publishedDate": "2026-03-01",
    "wordCount": 1847,
    "url": "https://example.com/blog-post"
  }
}

Python Example: Feeding Web Content to an LLM

from snapapi import SnapAPI
from openai import OpenAI

snap = SnapAPI("YOUR_SNAPAPI_KEY")
openai = OpenAI(api_key="YOUR_OPENAI_KEY")

# Step 1: Extract content from a webpage
extracted = snap.extract(
    url="https://techcrunch.com/2026/03/01/some-article",
    format="markdown"
)

# Step 2: Send to LLM for analysis
response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": "You are a tech news analyst. Summarize the following article and identify key takeaways."
        },
        {
            "role": "user",
            "content": f"Analyze this article:

{extracted['content']}"
        }
    ]
)

print(response.choices[0].message.content)

Building a RAG Pipeline

Retrieval-Augmented Generation (RAG) systems need clean text to build their knowledge base. Here is how to use SnapAPI extract with a vector database:

from snapapi import SnapAPI
import chromadb
from sentence_transformers import SentenceTransformer

snap = SnapAPI("YOUR_API_KEY")
model = SentenceTransformer("all-MiniLM-L6-v2")
chroma = chromadb.Client()
collection = chroma.create_collection("web_knowledge")

urls = [
    "https://docs.python.org/3/tutorial/index.html",
    "https://fastapi.tiangolo.com/tutorial/",
    "https://docs.docker.com/get-started/",
]

for url in urls:
    # Extract clean content
    result = snap.extract(url=url, format="markdown")
    content = result["content"]

    # Chunk the content (simple approach: split by paragraphs)
    chunks = [c.strip() for c in content.split("

") if len(c.strip()) > 50]

    # Generate embeddings and store
    for i, chunk in enumerate(chunks):
        embedding = model.encode(chunk).tolist()
        collection.add(
            documents=[chunk],
            embeddings=[embedding],
            ids=[f"{url}_{i}"],
            metadatas=[{"url": url, "chunk_index": i}]
        )

    print(f"Indexed {len(chunks)} chunks from {url}")

# Now query the knowledge base
query = "How do I create a Docker container?"
query_embedding = model.encode(query).tolist()
results = collection.query(query_embeddings=[query_embedding], n_results=3)

for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
    print(f"
From {meta['url']}:")
    print(doc[:200] + "...")

Scrape vs. Extract: When to Use Which

Use Case Best Endpoint Why
Product data from e-commerce/v1/scrapeYou know the exact selectors for price, name, image
Blog content for LLM analysis/v1/extractYou want clean markdown, not specific elements
Competitor pricing tables/v1/scrapeStructured data from known page layouts
News article summarization/v1/extractMain content extraction without HTML noise
Job listings from career pages/v1/scrapeRepeating elements with consistent selectors
Documentation for RAG/v1/extractFull page content in embedding-ready format
Social media profiles/v1/scrapeSpecific data points (followers, bio, posts)
Research paper analysis/v1/extractFull text content for LLM processing

Handling Common Challenges

Anti-Bot Protection

Many websites use bot detection (Cloudflare, DataDome, PerimeterX). SnapAPI uses stealth browser profiles that mimic real user behavior, making your scraping requests harder to detect and block. Features include:

Rate Limiting and Politeness

Responsible scraping means respecting the target website's resources. Here is a pattern for polite, rate-limited scraping:

import time
import random

def scrape_politely(client, urls, selectors, min_delay=1.0, max_delay=3.0):
    """Scrape multiple URLs with random delays to be polite."""
    results = []

    for url in urls:
        try:
            data = client.scrape(url=url, selectors=selectors)
            results.append({"url": url, "data": data["data"], "status": "ok"})
        except Exception as e:
            results.append({"url": url, "error": str(e), "status": "error"})

        # Random delay between requests
        delay = random.uniform(min_delay, max_delay)
        time.sleep(delay)

    return results

Handling JavaScript-Heavy SPAs

Single-page applications often load data via API calls after the initial page render. SnapAPI handles this automatically since it uses a real browser, but you may need to add wait conditions:

# For React/Vue/Angular apps
result = client.scrape(
    url="https://react-app.com/products",
    selectors={"items": ".product-item"},
    wait_for_selector=".product-item",
    delay=2000,  # Extra wait for lazy-loaded content
    block_ads=True
)

Building a Complete Data Pipeline

Here is a production-ready pattern that combines scraping with data storage and monitoring:

import json
import time
from datetime import datetime
from pathlib import Path
from snapapi import SnapAPI

client = SnapAPI("YOUR_API_KEY")

def run_competitor_monitor():
    """Daily competitor pricing monitor."""

    competitors = [
        {
            "name": "competitor_a",
            "url": "https://competitor-a.com/pricing",
            "selectors": {
                "plan_names": ".plan-name",
                "plan_prices": ".plan-price",
                "plan_features": ".plan-features li",
            }
        },
        {
            "name": "competitor_b",
            "url": "https://competitor-b.com/pricing",
            "selectors": {
                "plan_names": ".tier-title",
                "plan_prices": ".tier-price",
                "plan_features": ".tier-feature",
            }
        },
    ]

    timestamp = datetime.now().isoformat()
    results = {"timestamp": timestamp, "data": {}}

    for comp in competitors:
        try:
            scraped = client.scrape(
                url=comp["url"],
                selectors=comp["selectors"],
                block_ads=True,
                block_cookie_banners=True
            )
            results["data"][comp["name"]] = scraped["data"]
            print(f"  Scraped {comp['name']}: OK")
        except Exception as e:
            results["data"][comp["name"]] = {"error": str(e)}
            print(f"  Scraped {comp['name']}: FAILED - {e}")

        time.sleep(2)

    # Save results
    output_dir = Path("pricing_data")
    output_dir.mkdir(exist_ok=True)
    date_str = datetime.now().strftime("%Y-%m-%d")

    with open(output_dir / f"{date_str}.json", "w") as f:
        json.dump(results, f, indent=2)

    return results

# Run the monitor
data = run_competitor_monitor()
print(json.dumps(data, indent=2))

Conclusion

Web data extraction has evolved beyond simple HTML parsing. Modern APIs like SnapAPI handle the hard parts -- JavaScript rendering, bot detection, dynamic content, and browser management -- so you can focus on what to do with the data rather than how to get it.

Key takeaways:

Start extracting web data today

200 free requests per month. Scrape, extract, and screenshot with a single API.

Get Your Free API Key