Data Extraction
JSON
AI
How to Extract Structured JSON
from Any Website (Without Parsing HTML)
📅 March 19, 2026
•
⏱ 9 min read
•
By Papalily Team
Every developer who's scraped data has lived this nightmare: spend 45 minutes inspecting the DOM,
writing CSS selectors, testing edge cases — then the website redesigns two weeks later and your
parser breaks. The need to extract JSON from a website without maintaining
brittle HTML parsers is one of the most common frustrations in data engineering.
In 2026, AI extraction has solved this problem.
The Old Way: HTML Parsing
Traditional data extraction involves these steps — and every one is a potential point of failure:
Python with BeautifulSoup
old-way.py (fragile — breaks on site updates)
import requests
from bs4 import BeautifulSoup
# Step 1: Fetch HTML (fails on React/Vue sites)
html = requests.get('https://shop.example.com/products').text
# Step 2: Parse HTML structure
soup = BeautifulSoup(html, 'html.parser')
# Step 3: Find elements with CSS selectors — THIS BREAKS
products = []
for card in soup.select('.ProductCard__container--xKp7Q'): # 🔴 Hashed class
products.append({
'name': card.select_one('.ProductCard__title--aB3R').text.strip(),
'price': card.select_one('.ProductCard__price--nV8K').text.strip(),
# One redesign and this entire file is useless
})
# Every selector above will break the next time they push a build
This approach has three fundamental problems: it fails on JavaScript-rendered pages, it uses
class names that change between builds, and it requires a unique parser per website. With 10 target
sites, you're maintaining 10 different parsers.
Why Parsers Break So Often
- CSS Modules: Generate random hashes like
ProductCard__title--aB3R that change on every build
- Tailwind JIT: Utility classes get purged and renamed unpredictably
- A/B testing: Variant components have different DOM structures
- Design updates: Even minor UI refreshes restructure the DOM
- CDN caching: Different users see different versions during deploys
The New Way: Describe What You Want, Get JSON
AI extraction flips the model entirely. Instead of specifying where data is in the HTML,
you describe what data you want in plain English. The AI reads the rendered page and
extracts it — regardless of how the HTML is structured.
cURL — Simplest possible example
cURL — extract JSON from any website
curl -X POST https://api.papalily.com/scrape \
-H "x-api-key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://shop.example.com/products",
"prompt": "Extract all products with name, price, original price, rating, review count, and product URL. Return as JSON array called products."
}'
# Clean JSON back, regardless of site structure:
{
"success": true,
"data": {
"products": [
{
"name": "Wireless Headphones Pro",
"price": "$89.99",
"original_price": "$129.99",
"rating": 4.7,
"review_count": 1283,
"url": "https://shop.example.com/products/wireless-headphones-pro"
}
]
}
}
Node.js — Extract data from different content types
const API_KEY = process.env.PAPALILY_API_KEY;
async function extract(url, prompt) {
const res = await fetch('https://api.papalily.com/scrape', {
method: 'POST',
headers: { 'x-api-key': API_KEY, 'Content-Type': 'application/json' },
body: JSON.stringify({ url, prompt }),
});
return (await res.json()).data;
}
// Extract a data table
const tableData = await extract(
'https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue',
'Extract the first table as a JSON array. Each row should have: rank, company, revenue_usd_billions, employees, country, industry.'
);
// Extract article content
const article = await extract(
'https://techcrunch.com/2026/03/01/some-article',
'Extract: title, author, published_date, estimated_read_time_minutes, main_content_text, and a list of key_points (bullet points from the article).'
);
// Extract a product card grid
const products = await extract(
'https://www.bestbuy.com/site/laptops',
'Get all laptops with name, brand, price, rating, review_count, and product_url. Return as array.'
);
console.log({ tableData, article, products });
Python — Extract and convert to DataFrame
import requests
import pandas as pd
import os
def extract_json(url, prompt):
"""Extract structured JSON from any URL using AI."""
resp = requests.post(
'https://api.papalily.com/scrape',
headers={'x-api-key': os.environ['PAPALILY_API_KEY']},
json={'url': url, 'prompt': prompt},
timeout=60
)
result = resp.json()
if not result['success']:
raise ValueError(f"Extraction failed: {result.get('error')}")
return result['data']
# Extract a comparison table and load into pandas
data = extract_json(
'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)',
'Extract the main GDP table as a list of objects with: rank, country, gdp_usd_millions, year.'
)
# Works regardless of Wikipedia's table HTML structure
df = pd.DataFrame(data.get('rows', data.get('countries', [])))
print(df.head(10))
df.to_csv('gdp_data.csv', index=False)
Examples for Different Data Types
Extracting Tables
For any data table on a webpage: "Extract the [table name] table as a JSON array. Each row should have columns: [col1], [col2], [col3]."
Works on Wikipedia tables, financial data tables, comparison matrices, spec sheets.
Extracting Lists
For numbered or bulleted lists: "Get all items in the [list name] as a JSON array of strings."
Or for structured lists: "Get all FAQ items as objects with question and answer fields."
Extracting Product Cards
For product grids or carousels: "Get all product cards with name, price, image URL, rating, and product URL."
The AI understands "product card" semantically and finds them regardless of CSS class names.
Extracting Articles
For news articles or blog posts: "Extract: headline, subheadline, author name, published date, article body text, and list of tags."
Works across news sites, blogs, and content platforms with completely different markup.
Extracting Contact Information
For business directories: "Get all businesses listed with name, phone number, address, website URL, and category."
Why This Doesn't Break on Redesigns
Traditional parsers rely on the position of data in the DOM — its CSS class, its parent
element, its sibling index. When the DOM changes, the position changes and the parser breaks.
AI extraction relies on the meaning of data. The AI understands that "price" means
a currency amount near a product name, regardless of whether it's in a
<span class="price">, a <div data-price="89.99">,
or a <p class="gX7Kqp">. When the site redesigns, the meaning stays the same
— so the extraction keeps working.
Stop Writing HTML Parsers
Get clean JSON from any website with a single API call. No selectors to maintain, no parsers to debug,
no broken scrapers after site redesigns. Try it free — 100 requests/month.
Get Free API Key on RapidAPI →
Full docs at papalily.com/docs