You've been there. You write a scraper on Monday. By Friday the site changed its markup,
your selectors are broken, and the data pipeline is down. You spend two hours debugging
CSS selectors for a site that never published its internal HTML structure in the first place
and probably never will. This is the reality of scraping modern web applications.
It gets worse. The site is built with React. Your requests call returns a
skeleton HTML with one <div id="root"></div> and nothing else.
The actual content is rendered client-side by JavaScript that runs in the browser —
not in your scraper.
This post covers why traditional scrapers fail on modern JS-heavy sites, how the old
workarounds (Puppeteer, Playwright) still leave you wrestling with selectors,
and how AI-powered scraping changes the game entirely.
Why Traditional Scrapers Fail on React and Vue Sites
Classic web scraping tools like requests (Python), axios (Node.js),
or curl operate at the HTTP level. They fetch the raw HTML response from the
server. That works fine for static sites and server-side rendered pages.
But React, Vue, Angular, Next.js (in CSR mode), and countless other modern frameworks
deliver a minimal HTML shell to the browser, then hydrate the UI using
JavaScript. The actual product listings, prices, job titles, or article headlines are
never in the initial HTML response — they're injected into the DOM after JS executes.
Three things make this especially painful:
-
Hydration delay: Even if you use a headless browser, you need to wait
for React/Vue to finish rendering. The timing is unpredictable — it depends on
network speed, API calls the page makes, and component lifecycle hooks.
-
Dynamic class names: Tools like CSS Modules, Tailwind JIT, or styled-components
often generate hashed or utility class names (
sc-aX7bV, tw-flex-1)
that are meaningless for targeting and change between builds.
-
Structural changes: When a design team ships a UI update, the DOM
structure changes. Your scraper breaks silently — or loudly at 3am when your
monitoring pipeline crashes.
The Old Approach: Puppeteer / Playwright + Selectors
The standard solution has been to use a headless browser — Puppeteer (Chrome)
or Playwright (multi-browser) — to render the page, then use CSS selectors or
XPath to extract the data. This works, but it introduces a different set of problems.
Here's a typical Playwright scraper for a React-based job board:
scraper-old.js (Playwright + selectors)
const { chromium } = require('playwright');
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://jobs.example.com/software-engineer');
// Wait for the React app to hydrate
await page.waitForSelector('.JobListing__container--xK7pQ');
const jobs = await page.$$eval(
'.JobListing__container--xK7pQ',
els => els.map(el => ({
title: el.querySelector('.JobListing__title--aB3Rd')?.textContent?.trim(),
company: el.querySelector('.JobListing__company--qZ9Lx')?.textContent?.trim(),
location: el.querySelector('.JobListing__location--mP2Yw')?.textContent?.trim(),
salary: el.querySelector('.JobListing__salary--nV8Kj')?.textContent?.trim(),
}))
);
await browser.close();
console.log(jobs);
This works — until the site ships a new build with different class names. Then you're
back to inspecting the DOM, finding the new selectors, and pushing a fix. For high-churn
sites, this maintenance burden is significant.
You also have to handle: waiting for the right element, scrolling to load lazy content,
closing cookie banners, and a dozen edge cases specific to each site. Every site is its
own mini-project.
The AI-Powered Approach: Describe What You Want, Get JSON Back
What if instead of specifying where the data is in the HTML, you just described
what the data is? That's the core idea behind AI-powered scraping.
Papalily works in three steps:
- Send a URL and a plain-English prompt describing what you want to extract.
- A real Chromium browser renders the page — executing all JavaScript, waiting for React/Vue to hydrate, and capturing the final DOM state exactly as a human would see it.
- Gemini AI reads the rendered page (text + screenshot) and extracts precisely the data you described, returning clean structured JSON.
No selectors. No XPath. No fragile DOM queries. If the site redesigns next week, your
prompt still works because the AI understands the semantic meaning of the content —
not its position in the DOM tree.
Code Examples
cURL — Quickest way to test
curl -X POST https://api.papalily.com/scrape \
-H "x-api-key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://jobs.example.com/software-engineer",
"prompt": "Get all job listings with title, company, location, salary range, and job URL"
}'
# Response
{
"success": true,
"data": {
"jobs": [
{
"title": "Senior Backend Engineer",
"company": "Acme Corp",
"location": "Remote",
"salary": "$130,000 – $160,000",
"url": "https://jobs.example.com/listing/sr-backend-123"
}
]
},
"meta": { "duration_ms": 8921 }
}
Node.js — E-commerce price monitoring
// Monitor competitor prices on a React-based product page
async function getProductPrices(url) {
const res = await fetch('https://api.papalily.com/scrape', {
method: 'POST',
headers: {
'x-api-key': process.env.PAPALILY_API_KEY,
'Content-Type': 'application/json',
},
body: JSON.stringify({
url,
prompt: 'Get all products with name, current price, original price, and discount percentage. Include out-of-stock status.',
}),
});
const { data } = await res.json();
return data.products;
}
// Run it
const products = await getProductPrices('https://shop.competitor.com/laptops');
console.log(`Found ${products.length} products`);
// Compare with your own prices
products.forEach(p => {
if (parseFloat(p.current_price) < getOwnPrice(p.name)) {
console.log(`⚠ Competitor undercuts us on: ${p.name}`);
}
});
Python — News aggregation
import requests
import os
def scrape_news(url, topic):
"""Scrape latest articles from any news site, React or not."""
resp = requests.post(
'https://api.papalily.com/scrape',
headers={'x-api-key': os.environ['PAPALILY_API_KEY']},
json={
'url': url,
'prompt': ff'Get the 10 most recent articles about {topic}. '
'Return title, author, published date, summary, and article URL for each.',
'wait_ms': 3000, # Extra wait for lazy-loaded content
}
)
return resp.json()['data']['articles']
# Works on React-based news sites, Vue-based blogs, static sites — anything
articles = scrape_news('https://techcrunch.com', 'artificial intelligence')
for article in articles:
print(f"{article['title']} — {article['author']} ({article['published_date']})")
Batch scraping — Multiple URLs in one call
// Scrape 5 competitor product pages in one API call
const res = await fetch('https://api.papalily.com/batch', {
method: 'POST',
headers: {
'x-api-key': process.env.PAPALILY_API_KEY,
'Content-Type': 'application/json',
},
body: JSON.stringify({
requests: [
{ url: 'https://shop-a.com/headphones', prompt: 'All headphones with name and price' },
{ url: 'https://shop-b.com/headphones', prompt: 'All headphones with name and price' },
{ url: 'https://shop-c.com/headphones', prompt: 'All headphones with name and price' },
],
}),
});
const { results } = await res.json();
// All 3 run in parallel — total time ≈ slowest single scrape
results.forEach((r, i) => console.log(`Shop ${i+1}:`, r.data));
Real-World Use Cases
E-commerce Price Monitoring
Monitor competitor prices across React-based e-commerce stores. Traditional scrapers break
every time a store runs a frontend A/B test or updates their design. With AI extraction,
your prompt ("get all product prices and availability") keeps working regardless of
how the store's markup changes.
Run it on a schedule (cron, GitHub Actions, whatever) and push the results to a database.
Build price alerts, trend charts, or auto-repricing logic on top.
Job Listings Aggregation
Job boards are almost universally React or Vue now (Greenhouse, Lever, Ashby, Workday
— all JS-heavy). Aggregating listings from multiple boards used to require a
different scraper per platform. With a prompt like "get all open positions with title,
department, location, and apply URL," you get consistent JSON from every board without
writing a single platform-specific selector.
News and Content Aggregation
Build your own news reader, industry digest, or research tool by pulling articles from
multiple sources. The AI understands what an "article title" and "publication date" mean
semantically, so it works across different news site layouts without configuration.
Real Estate and Rental Listings
Real estate sites are notoriously complex — map-based UIs, lazy loading, infinite
scroll, all built in React. A prompt like "get all apartment listings with price, beds,
baths, square footage, and listing URL" works across Zillow, Redfin, local agency sites,
and any other platform.
The Maintenance Advantage
The real ROI of AI scraping isn't just initial development speed (though that's significant).
It's the ongoing maintenance savings.
With selector-based scrapers, every site update is a potential break. Teams at larger
companies often dedicate engineering time specifically to "scraper maintenance" —
a cost that grows linearly with the number of sites scraped.
AI-powered extraction is fundamentally more resilient to change because it understands
meaning, not structure. The same way a human can find the price on a
product page regardless of how it's styled, Gemini can extract the price whether it's
in a <span class="price">, a <div data-testid="product-price">,
or an element with a randomly generated class name.
Try Papalily Free
100 free requests per month. No credit card. Works on any site — React, Vue,
Angular, Next.js, or plain HTML. Get your API key in seconds.
Get Free API Key on RapidAPI →
Available on RapidAPI — secure billing, instant access
Limitations to Know
AI scraping isn't magic. A few things to keep in mind:
-
Response time: Because a real browser renders the page (8–15 seconds
typically), this isn't suitable for real-time APIs. It's designed for batch jobs,
scheduled pipelines, and background tasks.
-
Login-walled content: Papalily doesn't handle authentication. It scrapes
publicly accessible pages. Logged-in content requires a different approach.
-
Anti-bot measures: Aggressive bot detection (Cloudflare challenge pages,
CAPTCHAs) may block even real browser renders. A real browser helps a lot here, but
it's not a silver bullet for every site.
-
Very large datasets: If you need thousands of paginated pages scraped
daily, you'll want a dedicated scraping infrastructure. Papalily is optimized for
targeted, high-value extractions rather than bulk crawling.
Getting Started in 2 Minutes
- Sign up for a free API key at RapidAPI (no credit card needed).
- Copy the cURL example above, replace
YOUR_API_KEY, change the URL and prompt to a site you actually need.
- Run it. You'll have clean JSON back in under 15 seconds.
The API is documented at papalily.com/docs
if you want to explore all the parameters (wait_ms, no_cache,
batch mode, etc.).
Writing a scraper for a React site in 2026 shouldn't require you to become an expert in
that site's internal DOM structure. Describe what you want. Get the data. That's it.