Data Extraction JSON AI

How to Extract Structured JSON
from Any Website (Without Parsing HTML)

📅 March 19, 2026 ⏱ 9 min read By Papalily Team

Every developer who's scraped data has lived this nightmare: spend 45 minutes inspecting the DOM, writing CSS selectors, testing edge cases — then the website redesigns two weeks later and your parser breaks. The need to extract JSON from a website without maintaining brittle HTML parsers is one of the most common frustrations in data engineering. In 2026, AI extraction has solved this problem.

The Old Way: HTML Parsing

Traditional data extraction involves these steps — and every one is a potential point of failure:

Python with BeautifulSoup

old-way.py (fragile — breaks on site updates)
import requests from bs4 import BeautifulSoup # Step 1: Fetch HTML (fails on React/Vue sites) html = requests.get('https://shop.example.com/products').text # Step 2: Parse HTML structure soup = BeautifulSoup(html, 'html.parser') # Step 3: Find elements with CSS selectors — THIS BREAKS products = [] for card in soup.select('.ProductCard__container--xKp7Q'): # 🔴 Hashed class products.append({ 'name': card.select_one('.ProductCard__title--aB3R').text.strip(), 'price': card.select_one('.ProductCard__price--nV8K').text.strip(), # One redesign and this entire file is useless }) # Every selector above will break the next time they push a build

This approach has three fundamental problems: it fails on JavaScript-rendered pages, it uses class names that change between builds, and it requires a unique parser per website. With 10 target sites, you're maintaining 10 different parsers.

Why Parsers Break So Often

The New Way: Describe What You Want, Get JSON

AI extraction flips the model entirely. Instead of specifying where data is in the HTML, you describe what data you want in plain English. The AI reads the rendered page and extracts it — regardless of how the HTML is structured.

cURL — Simplest possible example

cURL — extract JSON from any website
curl -X POST https://api.papalily.com/scrape \ -H "x-api-key: YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "url": "https://shop.example.com/products", "prompt": "Extract all products with name, price, original price, rating, review count, and product URL. Return as JSON array called products." }' # Clean JSON back, regardless of site structure: { "success": true, "data": { "products": [ { "name": "Wireless Headphones Pro", "price": "$89.99", "original_price": "$129.99", "rating": 4.7, "review_count": 1283, "url": "https://shop.example.com/products/wireless-headphones-pro" } ] } }

Node.js — Extract data from different content types

extract.js
const API_KEY = process.env.PAPALILY_API_KEY; async function extract(url, prompt) { const res = await fetch('https://api.papalily.com/scrape', { method: 'POST', headers: { 'x-api-key': API_KEY, 'Content-Type': 'application/json' }, body: JSON.stringify({ url, prompt }), }); return (await res.json()).data; } // Extract a data table const tableData = await extract( 'https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue', 'Extract the first table as a JSON array. Each row should have: rank, company, revenue_usd_billions, employees, country, industry.' ); // Extract article content const article = await extract( 'https://techcrunch.com/2026/03/01/some-article', 'Extract: title, author, published_date, estimated_read_time_minutes, main_content_text, and a list of key_points (bullet points from the article).' ); // Extract a product card grid const products = await extract( 'https://www.bestbuy.com/site/laptops', 'Get all laptops with name, brand, price, rating, review_count, and product_url. Return as array.' ); console.log({ tableData, article, products });

Python — Extract and convert to DataFrame

extract_to_dataframe.py
import requests import pandas as pd import os def extract_json(url, prompt): """Extract structured JSON from any URL using AI.""" resp = requests.post( 'https://api.papalily.com/scrape', headers={'x-api-key': os.environ['PAPALILY_API_KEY']}, json={'url': url, 'prompt': prompt}, timeout=60 ) result = resp.json() if not result['success']: raise ValueError(f"Extraction failed: {result.get('error')}") return result['data'] # Extract a comparison table and load into pandas data = extract_json( 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)', 'Extract the main GDP table as a list of objects with: rank, country, gdp_usd_millions, year.' ) # Works regardless of Wikipedia's table HTML structure df = pd.DataFrame(data.get('rows', data.get('countries', []))) print(df.head(10)) df.to_csv('gdp_data.csv', index=False)

Examples for Different Data Types

Extracting Tables

For any data table on a webpage: "Extract the [table name] table as a JSON array. Each row should have columns: [col1], [col2], [col3]." Works on Wikipedia tables, financial data tables, comparison matrices, spec sheets.

Extracting Lists

For numbered or bulleted lists: "Get all items in the [list name] as a JSON array of strings." Or for structured lists: "Get all FAQ items as objects with question and answer fields."

Extracting Product Cards

For product grids or carousels: "Get all product cards with name, price, image URL, rating, and product URL." The AI understands "product card" semantically and finds them regardless of CSS class names.

Extracting Articles

For news articles or blog posts: "Extract: headline, subheadline, author name, published date, article body text, and list of tags." Works across news sites, blogs, and content platforms with completely different markup.

Extracting Contact Information

For business directories: "Get all businesses listed with name, phone number, address, website URL, and category."

Why This Doesn't Break on Redesigns

Traditional parsers rely on the position of data in the DOM — its CSS class, its parent element, its sibling index. When the DOM changes, the position changes and the parser breaks.

AI extraction relies on the meaning of data. The AI understands that "price" means a currency amount near a product name, regardless of whether it's in a <span class="price">, a <div data-price="89.99">, or a <p class="gX7Kqp">. When the site redesigns, the meaning stays the same — so the extraction keeps working.

Stop Writing HTML Parsers

Get clean JSON from any website with a single API call. No selectors to maintain, no parsers to debug, no broken scrapers after site redesigns. Try it free — 100 requests/month.

Get Free API Key on RapidAPI →

Full docs at papalily.com/docs