Extracting Structured Data from Websites: A Developer's Guide
The web is the largest database ever created, but most of it is locked behind HTML markup. Extracting useful, structured data from web pages is a fundamental task for AI pipelines, competitive intelligence, market research, content aggregation, and dozens of other applications. The challenge is doing it reliably, at scale, and without building a fragile scraping infrastructure.
This guide covers modern approaches to web data extraction, with practical examples using SnapAPI's /v1/scrape and /v1/extract endpoints.
Two Approaches to Web Data Extraction
There are two fundamentally different ways to get data from a webpage:
1. Structured Scraping (CSS/XPath Selectors)
You specify exactly which elements you want using CSS selectors or XPath expressions. The API returns structured JSON with the data from those elements. This approach is precise and fast, but requires you to know the page structure in advance.
2. Content Extraction (URL to Markdown/Text)
You point the API at a URL, and it extracts the main content as clean markdown or plain text. No selectors needed. This is ideal for AI/LLM pipelines where you want to feed web content into a language model without HTML noise.
SnapAPI supports both approaches. Let us look at each one in detail.
Structured Scraping with /v1/scrape
The /v1/scrape endpoint renders a page in a real browser (handling JavaScript, SPAs, and dynamic content) and then extracts data using CSS selectors you define.
Basic Example: Extracting Page Metadata
curl "https://api.snapapi.pics/v1/scrape" -H "Authorization: Bearer YOUR_API_KEY" -H "Content-Type: application/json" -d '{
"url": "https://news.ycombinator.com",
"selectors": {
"title": "title",
"top_stories": ".titleline > a",
"scores": ".score"
}
}'
The response is clean JSON:
{
"success": true,
"data": {
"title": "Hacker News",
"top_stories": [
"Show HN: I built a screenshot API that does 5 things",
"PostgreSQL 17 is now available",
"The state of WebAssembly in 2026"
],
"scores": [
"342 points",
"287 points",
"201 points"
]
},
"metadata": {
"url": "https://news.ycombinator.com",
"statusCode": 200,
"loadTime": 1243
}
}
Advanced Selectors: Extracting Attributes and Nested Data
You are not limited to text content. Extract attributes, HTML, and nested structures:
from snapapi import SnapAPI
client = SnapAPI("YOUR_API_KEY")
result = client.scrape(
url="https://example-ecommerce.com/products",
selectors={
# Extract text content (default)
"product_names": ".product-card h3",
# Extract specific attributes with @attr syntax
"product_images": ".product-card img@src",
"product_links": ".product-card a@href",
# Extract prices
"prices": ".product-card .price",
# Extract data attributes
"product_ids": ".product-card@data-product-id",
}
)
# Build structured product data
products = []
for i in range(len(result["data"]["product_names"])):
products.append({
"name": result["data"]["product_names"][i],
"price": result["data"]["prices"][i],
"image": result["data"]["product_images"][i],
"link": result["data"]["product_links"][i],
"id": result["data"]["product_ids"][i],
})
print(f"Found {len(products)} products")
Handling Dynamic Content
Many modern websites load content dynamically via JavaScript. SnapAPI renders pages in a real browser, so JavaScript-rendered content is available. You can also wait for specific elements to appear:
# Wait for dynamic content to load
result = client.scrape(
url="https://spa-app.com/dashboard",
selectors={
"metrics": ".metric-value",
"chart_labels": ".chart-label",
},
wait_for_selector=".metric-value", # Wait until this element exists
delay=2000, # Additional 2-second wait after element appears
)
Pagination: Scraping Multiple Pages
all_products = []
for page_num in range(1, 11): # Scrape 10 pages
url = f"https://example-shop.com/products?page={page_num}"
result = client.scrape(
url=url,
selectors={
"names": ".product-name",
"prices": ".product-price",
"ratings": ".product-rating@data-score",
}
)
page_products = result["data"]["names"]
if not page_products:
break # No more results
for i in range(len(page_products)):
all_products.append({
"name": result["data"]["names"][i],
"price": result["data"]["prices"][i],
"rating": result["data"]["ratings"][i] if i < len(result["data"]["ratings"]) else None,
})
print(f"Page {page_num}: found {len(page_products)} products")
print(f"
Total: {len(all_products)} products scraped")
Content Extraction with /v1/extract
The /v1/extract endpoint is designed for a different use case: converting an entire webpage into clean, readable text or markdown. This is the endpoint you want when building AI/LLM data pipelines.
Why Content Extraction Matters for AI
Large language models work with text, not HTML. If you feed raw HTML into a language model, most of the token budget is wasted on tags, attributes, scripts, and styling. Content extraction strips all of that away, giving you just the meaningful content in a format the model can work with efficiently.
Basic Content Extraction
curl "https://api.snapapi.pics/v1/extract?url=https://example.com/blog-post&format=markdown" -H "Authorization: Bearer YOUR_API_KEY"
Response:
{
"success": true,
"content": "# Blog Post Title
This is the main content of the blog post, extracted as clean markdown...
## Section Heading
More content here with **bold** and *italic* formatting preserved.
- List item 1
- List item 2
- List item 3
",
"metadata": {
"title": "Blog Post Title",
"description": "Meta description of the page",
"author": "Author Name",
"publishedDate": "2026-03-01",
"wordCount": 1847,
"url": "https://example.com/blog-post"
}
}
Python Example: Feeding Web Content to an LLM
from snapapi import SnapAPI
from openai import OpenAI
snap = SnapAPI("YOUR_SNAPAPI_KEY")
openai = OpenAI(api_key="YOUR_OPENAI_KEY")
# Step 1: Extract content from a webpage
extracted = snap.extract(
url="https://techcrunch.com/2026/03/01/some-article",
format="markdown"
)
# Step 2: Send to LLM for analysis
response = openai.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "You are a tech news analyst. Summarize the following article and identify key takeaways."
},
{
"role": "user",
"content": f"Analyze this article:
{extracted['content']}"
}
]
)
print(response.choices[0].message.content)
Building a RAG Pipeline
Retrieval-Augmented Generation (RAG) systems need clean text to build their knowledge base. Here is how to use SnapAPI extract with a vector database:
from snapapi import SnapAPI
import chromadb
from sentence_transformers import SentenceTransformer
snap = SnapAPI("YOUR_API_KEY")
model = SentenceTransformer("all-MiniLM-L6-v2")
chroma = chromadb.Client()
collection = chroma.create_collection("web_knowledge")
urls = [
"https://docs.python.org/3/tutorial/index.html",
"https://fastapi.tiangolo.com/tutorial/",
"https://docs.docker.com/get-started/",
]
for url in urls:
# Extract clean content
result = snap.extract(url=url, format="markdown")
content = result["content"]
# Chunk the content (simple approach: split by paragraphs)
chunks = [c.strip() for c in content.split("
") if len(c.strip()) > 50]
# Generate embeddings and store
for i, chunk in enumerate(chunks):
embedding = model.encode(chunk).tolist()
collection.add(
documents=[chunk],
embeddings=[embedding],
ids=[f"{url}_{i}"],
metadatas=[{"url": url, "chunk_index": i}]
)
print(f"Indexed {len(chunks)} chunks from {url}")
# Now query the knowledge base
query = "How do I create a Docker container?"
query_embedding = model.encode(query).tolist()
results = collection.query(query_embeddings=[query_embedding], n_results=3)
for doc, meta in zip(results["documents"][0], results["metadatas"][0]):
print(f"
From {meta['url']}:")
print(doc[:200] + "...")
Scrape vs. Extract: When to Use Which
| Use Case | Best Endpoint | Why |
|---|---|---|
| Product data from e-commerce | /v1/scrape | You know the exact selectors for price, name, image |
| Blog content for LLM analysis | /v1/extract | You want clean markdown, not specific elements |
| Competitor pricing tables | /v1/scrape | Structured data from known page layouts |
| News article summarization | /v1/extract | Main content extraction without HTML noise |
| Job listings from career pages | /v1/scrape | Repeating elements with consistent selectors |
| Documentation for RAG | /v1/extract | Full page content in embedding-ready format |
| Social media profiles | /v1/scrape | Specific data points (followers, bio, posts) |
| Research paper analysis | /v1/extract | Full text content for LLM processing |
Handling Common Challenges
Anti-Bot Protection
Many websites use bot detection (Cloudflare, DataDome, PerimeterX). SnapAPI uses stealth browser profiles that mimic real user behavior, making your scraping requests harder to detect and block. Features include:
- Realistic browser fingerprints (navigator properties, WebGL, Canvas)
- Human-like timing and interaction patterns
- Automatic proxy rotation for high-volume scraping
- Cookie and session management
Rate Limiting and Politeness
Responsible scraping means respecting the target website's resources. Here is a pattern for polite, rate-limited scraping:
import time
import random
def scrape_politely(client, urls, selectors, min_delay=1.0, max_delay=3.0):
"""Scrape multiple URLs with random delays to be polite."""
results = []
for url in urls:
try:
data = client.scrape(url=url, selectors=selectors)
results.append({"url": url, "data": data["data"], "status": "ok"})
except Exception as e:
results.append({"url": url, "error": str(e), "status": "error"})
# Random delay between requests
delay = random.uniform(min_delay, max_delay)
time.sleep(delay)
return results
Handling JavaScript-Heavy SPAs
Single-page applications often load data via API calls after the initial page render. SnapAPI handles this automatically since it uses a real browser, but you may need to add wait conditions:
# For React/Vue/Angular apps
result = client.scrape(
url="https://react-app.com/products",
selectors={"items": ".product-item"},
wait_for_selector=".product-item",
delay=2000, # Extra wait for lazy-loaded content
block_ads=True
)
Building a Complete Data Pipeline
Here is a production-ready pattern that combines scraping with data storage and monitoring:
import json
import time
from datetime import datetime
from pathlib import Path
from snapapi import SnapAPI
client = SnapAPI("YOUR_API_KEY")
def run_competitor_monitor():
"""Daily competitor pricing monitor."""
competitors = [
{
"name": "competitor_a",
"url": "https://competitor-a.com/pricing",
"selectors": {
"plan_names": ".plan-name",
"plan_prices": ".plan-price",
"plan_features": ".plan-features li",
}
},
{
"name": "competitor_b",
"url": "https://competitor-b.com/pricing",
"selectors": {
"plan_names": ".tier-title",
"plan_prices": ".tier-price",
"plan_features": ".tier-feature",
}
},
]
timestamp = datetime.now().isoformat()
results = {"timestamp": timestamp, "data": {}}
for comp in competitors:
try:
scraped = client.scrape(
url=comp["url"],
selectors=comp["selectors"],
block_ads=True,
block_cookie_banners=True
)
results["data"][comp["name"]] = scraped["data"]
print(f" Scraped {comp['name']}: OK")
except Exception as e:
results["data"][comp["name"]] = {"error": str(e)}
print(f" Scraped {comp['name']}: FAILED - {e}")
time.sleep(2)
# Save results
output_dir = Path("pricing_data")
output_dir.mkdir(exist_ok=True)
date_str = datetime.now().strftime("%Y-%m-%d")
with open(output_dir / f"{date_str}.json", "w") as f:
json.dump(results, f, indent=2)
return results
# Run the monitor
data = run_competitor_monitor()
print(json.dumps(data, indent=2))
Conclusion
Web data extraction has evolved beyond simple HTML parsing. Modern APIs like SnapAPI handle the hard parts -- JavaScript rendering, bot detection, dynamic content, and browser management -- so you can focus on what to do with the data rather than how to get it.
Key takeaways:
- Use
/v1/scrapewhen you need specific data points from known page structures - Use
/v1/extractwhen you want clean content for AI/LLM processing - Always implement rate limiting and respect target websites
- Cache results to minimize redundant API calls
- Build monitoring pipelines for ongoing data collection
Start extracting web data today
200 free requests per month. Scrape, extract, and screenshot with a single API.
Get Your Free API Key