Web Scraping API Python Tutorial: Extract Data in 5 Minutes
This tutorial shows you how to extract structured data from websites using Python and a web scraping API. You'll have working code in under 5 minutes , no BeautifulSoup, no Selenium, no proxy management. Just requests and a URL.
Why Use a Scraping API Instead of BeautifulSoup?
BeautifulSoup is great for simple, static pages. But production scraping hits walls fast:
- JavaScript rendering , half the web is rendered client-side. BeautifulSoup sees empty divs.
- Bot detection , Cloudflare, Datadome, PerimeterX block automated requests. Your scraper gets a CAPTCHA, not the page.
- Selector maintenance , write 50 CSS selectors today, update 12 of them next week because the site changed its HTML.
- Proxy rotation , scrape at scale and you need residential proxies, which cost $1-15/GB.
A scraping API reduces the infrastructure mess. You send a URL and prompt, then get structured data or a clear failure. You write business logic, not selector plumbing.
Setup: Get Your API Key
For this tutorial, we'll use Haunt API. It's free for 100 requests/month , enough to follow along and build something real.
- Go to the Haunt API signup form
- Create a free Haunt key, no credit card
- Copy your Haunt API key from the signup response
Install the only dependency you need:
pip install requests
Set your API key as an environment variable:
export HAUNT_API_KEY="your_key_here"
warning Never hardcode API keys in source code. Use environment variables or a .env file with python-dotenv.
Basic Extraction: Get Data From Any Page
Here's the simplest possible scraper. Three lines of actual code:
import requests
import os
API_KEY = os.environ["HAUNT_API_KEY"]
API_URL = "https://hauntapi.com/v1/extract"
def extract(url, prompt):
"""Extract data from a public URL using a natural language prompt."""
response = requests.post(
API_URL,
headers={
"X-API-Key": API_KEY,
"Content-Type": "application/json"
},
json={"url": url, "prompt": prompt}
)
response.raise_for_status()
return response.json()
# Extract product info from any e-commerce page
result = extract(
"https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"Get the book title, price, availability, and rating"
)
print(result)
The response comes back as structured JSON:
{
"data": {
"title": "A Light in the Attic",
"price": "£51.77",
"availability": "In stock (22 available)",
"rating": "Three"
}
}
No CSS selectors. No XPath. No inspecting the DOM. You describe what you want in plain English and get structured data back.
Structured Extraction with Prompts
The prompt is your control surface. Be specific about what you want and how you want it:
# Extract a list of items from a page
result = extract(
"https://news.ycombinator.com",
"Get the top 10 stories with their titles, scores, and URLs. Return as a list."
)
# Returns: {"data": [{"title": "...", "score": 342, "url": "..."}, ...]}
# Extract specific fields from a company page
result = extract(
"https://example-startup.com/about",
"Extract: company name, founding year, total funding amount, CEO name, and headquarters location"
)
# Returns: {"data": {"company_name": "...", "founded": 2020, ...}}
# Extract tabular data
result = extract(
"https://en.wikipedia.org/wiki/Python_(programming_language)",
"Get the main programming paradigm, designer, first appeared year, and current stable version as key-value pairs"
)
The key insight: the same request shape works across many public pages. Change the URL, keep the prompt. Or change the prompt, keep the URL. Keep explicit failure handling for pages that block automated access.
Batch Scraping Multiple URLs
Real projects involve scraping hundreds of pages. Here's a production-ready batch scraper with concurrency:
import requests
import os
from concurrent.futures import ThreadPoolExecutor, as_completed
API_KEY = os.environ["HAUNT_API_KEY"]
API_URL = "https://hauntapi.com/v1/extract"
def extract_one(url, prompt):
"""Extract data from a single URL. Returns (url, data) or (url, error)."""
try:
response = requests.post(
API_URL,
headers={
"X-API-Key": API_KEY,
"Content-Type": "application/json"
},
json={"url": url, "prompt": prompt},
timeout=30
)
response.raise_for_status()
return (url, response.json()["data"])
except Exception as e:
return (url, str(e))
def batch_extract(urls, prompt, max_workers=3):
"""Scrape multiple URLs concurrently. Rate-limited to 3 concurrent."""
results = {}
with ThreadPoolExecutor(max_workers=max_workers) as pool:
futures = {pool.submit(extract_one, url, prompt): url for url in urls}
for future in as_completed(futures):
url, data = future.result()
results[url] = data
print(f" Yes {url[:60]}...")
return results
# Example: scrape multiple product pages
urls = [
"https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
"https://books.toscrape.com/catalogue/soumission_998/index.html",
]
products = batch_extract(urls, "Get the book title, price, and availability")
for url, data in products.items():
print(f"{data.get('title', 'ERROR')}: {data.get('price', 'N/A')}")
Three concurrent workers is a good default. Most scraping APIs rate-limit around 5-10 requests/second on free tiers. Increase max_workers as you scale up.
Error Handling and Retries
Production scrapers need retry logic. Websites timeout. APIs hiccup. Here's a robust wrapper:
import time
import requests
def extract_with_retry(url, prompt, max_retries=3, backoff=2):
"""Extract with exponential backoff retry."""
for attempt in range(max_retries):
try:
response = requests.post(
API_URL,
headers={
"X-API-Key": API_KEY,
"Content-Type": "application/json"
},
json={"url": url, "prompt": prompt},
timeout=30
)
if response.status_code == 429:
# Rate limited , wait and retry
wait = backoff ** attempt
print(f" Rate limited. Waiting {wait}s...")
time.sleep(wait)
continue
response.raise_for_status()
return response.json()["data"]
except requests.exceptions.Timeout:
print(f" Timeout on attempt {attempt + 1}/{max_retries}")
if attempt < max_retries - 1:
time.sleep(backoff ** attempt)
continue
except requests.exceptions.HTTPError as e:
if response.status_code >= 500:
# Server error , retry
print(f" Server error {response.status_code}, retrying...")
time.sleep(backoff ** attempt)
continue
# Client error (4xx) , don't retry
raise
raise Exception(f"Failed after {max_retries} attempts: {url}")
This handles the three most common failure modes: rate limits (429), timeouts, and server errors (5xx). Client errors like 400 (bad request) or 401 (bad key) fail immediately , retrying won't help.
Saving Results to CSV and JSON
Extracted data is only useful if you save it. Here's a clean pattern for both formats:
import json
import csv
def save_json(data, filename="output.json"):
"""Save extracted data as JSON."""
with open(filename, "w") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
print(f"Saved {len(data)} records to {filename}")
def save_csv(data, filename="output.csv"):
"""Save list of dicts as CSV. Auto-detects columns."""
if not data:
return
# If data is a dict of results per URL, flatten it
if isinstance(data, dict):
rows = [{"url": k, **v} if isinstance(v, dict) else {"url": k, "value": v}
for k, v in data.items()]
else:
rows = data
with open(filename, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=rows[0].keys())
writer.writeheader()
writer.writerows(rows)
print(f"Saved {len(rows)} rows to {filename}")
# Usage
results = batch_extract(urls, "Get the book title, price, and availability")
save_json(results, "products.json")
save_csv(results, "products.csv")
Next Steps
You now have a safer pattern for extracting structured data from public pages with Python. Here's where to take it:
- Schedule it , wrap your scraper in a cron job or Celery task for daily price monitoring or news aggregation.
- Add to a database , pipe results into PostgreSQL, MongoDB, or a Google Sheet instead of files.
- Build a pipeline , chain extractions: scrape a directory page for URLs, then scrape each URL for details.
- Monitor changes , diff today's results against yesterday's. Get alerts when prices change or content updates.
The approach works because the API handles the fetch and extraction layers where supported, returns clear errors where not supported, and you handle the logic that matters to your project.
Start scraping in 30 seconds. 100 free requests/month, no credit card.
Get Free API Key →Turn a live page into structured JSON.
Use Haunt when selectors start lying to you.