How to Extract Structured JSON Data from Any Website (Without Parsing HTML)

Every developer who's scraped data has lived this nightmare: spend 45 minutes inspecting the DOM, writing CSS selectors, testing edge cases — then the website redesigns two weeks later and your parser breaks. The need to extract JSON from a website without maintaining brittle HTML parsers is one of the most common frustrations in data engineering. In 2026, AI extraction has solved this problem.

The Old Way: HTML Parsing

Traditional data extraction involves these steps — and every one is a potential point of failure:

Python with BeautifulSoup

import requests
from bs4 import BeautifulSoup

# Step 1: Fetch HTML (fails on React/Vue sites)
html = requests.get('https://shop.example.com/products').text

# Step 2: Parse HTML structure
soup = BeautifulSoup(html, 'html.parser')

# Step 3: Find elements with CSS selectors — THIS BREAKS
products = []
for card in soup.select('.ProductCard__container--xKp7Q'):  # 🔴 Hashed class
    products.append({
        'name': card.select_one('.ProductCard__title--aB3R').text.strip(),
        'price': card.select_one('.ProductCard__price--nV8K').text.strip(),
        # One redesign and this entire file is useless
    })

# Every selector above will break the next time they push a build

This approach has three fundamental problems: it fails on JavaScript-rendered pages, it uses class names that change between builds, and it requires a unique parser per website. With 10 target sites, you're maintaining 10 different parsers.

Why Parsers Break So Often

CSS Modules: Generate random hashes like ProductCard__title--aB3R that change on every build
Tailwind JIT: Utility classes get purged and renamed unpredictably
A/B testing: Variant components have different DOM structures
Design updates: Even minor UI refreshes restructure the DOM
CDN caching: Different users see different versions during deploys

The New Way: Describe What You Want, Get JSON

AI extraction flips the model entirely. Instead of specifying where data is in the HTML, you describe what data you want in plain English. The AI reads the rendered page and extracts it — regardless of how the HTML is structured.

cURL — Simplest possible example

curl -X POST https://api.papalily.com/scrape \
  -H "x-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://shop.example.com/products",
    "prompt": "Extract all products with name, price, original price, rating, review count, and product URL. Return as JSON array called products."
  }'

# Clean JSON back, regardless of site structure:
{
  "success": true,
  "data": {
    "products": [
      {
        "name": "Wireless Headphones Pro",
        "price": "$89.99",
        "original_price": "$129.99",
        "rating": 4.7,
        "review_count": 1283,
        "url": "https://shop.example.com/products/wireless-headphones-pro"
      }
    ]
  }
}

Node.js — Extract data from different content types

const API_KEY = process.env.PAPALILY_API_KEY;

async function extract(url, prompt) {
  const res = await fetch('https://api.papalily.com/scrape', {
    method: 'POST',
    headers: { 'x-api-key': API_KEY, 'Content-Type': 'application/json' },
    body: JSON.stringify({ url, prompt }),
  });
  return (await res.json()).data;
}

// Extract a data table
const tableData = await extract(
  'https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue',
  'Extract the first table as a JSON array. Each row should have: rank, company, revenue_usd_billions, employees, country, industry.'
);

// Extract article content
const article = await extract(
  'https://techcrunch.com/2026/03/01/some-article',
  'Extract: title, author, published_date, estimated_read_time_minutes, main_content_text, and a list of key_points (bullet points from the article).'
);

// Extract a product card grid
const products = await extract(
  'https://www.bestbuy.com/site/laptops',
  'Get all laptops with name, brand, price, rating, review_count, and product_url. Return as array.'
);

console.log({ tableData, article, products });

Python — Extract and convert to DataFrame

import requests
import pandas as pd
import os

def extract_json(url, prompt):
    """Extract structured JSON from any URL using AI."""
    resp = requests.post(
        'https://api.papalily.com/scrape',
        headers={'x-api-key': os.environ['PAPALILY_API_KEY']},
        json={'url': url, 'prompt': prompt},
        timeout=60
    )
    result = resp.json()
    if not result['success']:
        raise ValueError(f"Extraction failed: {result.get('error')}")
    return result['data']

# Extract a comparison table and load into pandas
data = extract_json(
    'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)',
    'Extract the main GDP table as a list of objects with: rank, country, gdp_usd_millions, year.'
)

# Works regardless of Wikipedia's table HTML structure
df = pd.DataFrame(data.get('rows', data.get('countries', [])))
print(df.head(10))
df.to_csv('gdp_data.csv', index=False)

Examples for Different Data Types

Extracting Tables

For any data table on a webpage: "Extract the [table name] table as a JSON array. Each row should have columns: [col1], [col2], [col3]." Works on Wikipedia tables, financial data tables, comparison matrices, spec sheets.

Extracting Lists

For numbered or bulleted lists: "Get all items in the [list name] as a JSON array of strings." Or for structured lists: "Get all FAQ items as objects with question and answer fields."

Extracting Product Cards

For product grids or carousels: "Get all product cards with name, price, image URL, rating, and product URL." The AI understands "product card" semantically and finds them regardless of CSS class names.

Extracting Articles

For news articles or blog posts: "Extract: headline, subheadline, author name, published date, article body text, and list of tags." Works across news sites, blogs, and content platforms with completely different markup.

Extracting Contact Information

For business directories: "Get all businesses listed with name, phone number, address, website URL, and category."

Why This Doesn't Break on Redesigns

Traditional parsers rely on the position of data in the DOM — its CSS class, its parent element, its sibling index. When the DOM changes, the position changes and the parser breaks.

AI extraction relies on the meaning of data. The AI understands that "price" means a currency amount near a product name, regardless of whether it's in a <span class="price">, a <div data-price="89.99">, or a <p class="gX7Kqp">. When the site redesigns, the meaning stays the same — so the extraction keeps working.

Stop Writing HTML Parsers

Get clean JSON from any website with a single API call. No selectors to maintain, no parsers to debug, no broken scrapers after site redesigns. Try it free — 100 requests/month.

Get Free API Key on RapidAPI →

Full docs at papalily.com/docs

How to Extract Structured JSON from Any Website (Without Parsing HTML)