field note
Back to blog

Extract Text From Supported Webpages Using an API

Why extract text via API instead of parsing HTML?

Traditional web scraping for text extraction follows a tedious pattern: fetch the HTML, parse it with BeautifulSoup, strip out navigation and footers, remove scripts and styles, then hope what's left is the actual article content.

It works , until it doesn't. Modern websites use dynamic rendering, shadow DOMs, and client-side frameworks that make simple HTML parsing unreliable. You end up with navigation text mixed into your content, or worse, empty results because the content loaded via JavaScript after your scraper already finished.

A text extraction API handles all of this for you:

  • JavaScript rendering , the API waits for the page to fully load
  • Content detection , ML-powered identification of the actual article/content area
  • Clean output , no nav bars, no footers, no ads, just the content you want
  • Blocked-page handling , clear failure metadata when a page requires login or human verification

Python: extract text from a URL in 3 lines

Here's the simplest way to extract readable text from a supported public webpage using Python:

import requests

response = requests.post("https://hauntapi.com/v1/extract", headers={
    "X-API-Key": "your-api-key",
    "Content-Type": "application/json"
}, json={
    "url": "https://en.wikipedia.org/wiki/Web_scraping"
})

data = response.json()
print(data["content"])  # Clean, extracted text

That's it. The API returns the page's main content as clean text , no HTML tags, no navigation clutter, no boilerplate.

With the Haunt API Python SDK

pip install hauntapi
from hauntapi import HauntClient

client = HauntClient(api_key="your-api-key")
result = client.extract("https://en.wikipedia.org/wiki/Web_scraping")
print(result.content)

JavaScript / Node.js example

Same thing in Node.js using the Fetch API:

const response = await fetch("https://hauntapi.com/v1/extract", {
  method: "POST",
  headers: {
    "X-API-Key": "your-api-key",
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    url: "https://en.wikipedia.org/wiki/Web_scraping"
  })
});

const data = await response.json();
console.log(data.content);

Get structured data, not just raw text

Sometimes you don't want all the text , you want specific data points. Haunt API lets you provide an extraction prompt to pull exactly what you need:

response = requests.post("https://hauntapi.com/v1/extract", headers={
    "X-API-Key": "your-api-key",
    "Content-Type": "application/json"
}, json={
    "url": "https://news.ycombinator.com",
    "prompt": "Extract the top 5 stories with their titles, points, and URLs as JSON"
})

print(response.json()["data"])
# Returns structured JSON with exactly what you asked for

This is where traditional scrapers fall short. With BeautifulSoup, you'd need to inspect the HTML, find the right CSS selectors, handle edge cases, and write brittle parsing code. With an extraction API, you just describe what you want in plain English.

Batch processing: extract text from multiple pages

Need to extract text from hundreds or thousands of pages? Here's a production-ready batch script with error handling:

import requests
import json
import time
from concurrent.futures import ThreadPoolExecutor, as_completed

API_KEY = "your-api-key"
URLS = [
    "https://example.com/page-1",
    "https://example.com/page-2",
    "https://example.com/page-3",
    # ... hundreds more
]

def extract_text(url):
    try:
        resp = requests.post(
            "https://hauntapi.com/v1/extract",
            headers={"X-API-Key": API_KEY, "Content-Type": "application/json"},
            json={"url": url},
            timeout=30
        )
        resp.raise_for_status()
        return {"url": url, "text": resp.json().get("content", ""), "status": "ok"}
    except Exception as e:
        return {"url": url, "text": "", "status": "error", "error": str(e)}

results = []
with ThreadPoolExecutor(max_workers=5) as pool:
    futures = {pool.submit(extract_text, url): url for url in URLS}
    for future in as_completed(futures):
        results.append(future.result())

# Save to file
with open("extracted.json", "w") as f:
    json.dump(results, f, indent=2)

print(f"Extracted {len([r for r in results if r['status'] == 'ok'])}/{len(URLS)} pages")

Comparison: API vs BeautifulSoup vs Readability

Here's how the approaches stack up for text extraction:

  • BeautifulSoup , Free, but you parse HTML manually. No JS rendering. Brittle selectors break when sites change. ~20 lines of code per site.
  • Mozilla Readability , Good for article content, but requires a headless browser. Doesn't handle non-article pages (product pages, listings, etc.).
  • Extraction API , Handles rendering fallback and content detection where supported. 3 lines of code. Paid plans start at 5,000 successful requests/month.

Cost breakdown

Haunt API gives you 100 free requests per month on the Free plan. No credit card needed. For production use, Starter is £19/month for 5,000 successful public-page extractions. Compare that to the developer time you'd spend writing and maintaining BeautifulSoup scrapers.

Start extracting text from webpages in minutes. 100 free requests, no credit card required.

Get Your Free API Key →

Common use cases

  • Content aggregation , Pull article text from multiple news sources into your platform
  • SEO analysis , Extract and analyze competitor page content at scale
  • Training data , Collect clean text datasets for LLM fine-tuning
  • Research automation , Extract papers, abstracts, and data from academic sources
  • Monitoring , Track changes in page content for compliance or competitive intelligence

The key insight: if you're still manually parsing HTML to get text from webpages, you're spending time on the wrong problem. Let the API handle rendering, parsing, and cleaning , focus on actually using the data.

Ready to extract clean text from supported public webpages?

Try Haunt API Free →
next scan

Turn a live page into structured JSON.

Use Haunt when selectors start lying to you.