This Node.js web scraping tutorial covers the full spectrum — from basic HTTP fetching and HTML parsing to full headless browser automation to AI-powered extraction that requires no selectors at all. By the end, you'll understand when to use each approach and have working code for all three. The progression mirrors how the web itself has evolved, and how scraping tools have had to evolve with it.
If your target site is server-rendered HTML — content that's visible in the initial HTTP response without JavaScript execution — this is the fastest and simplest approach.
Requirements: npm install cheerio
(built-in fetch in Node 18+, no other dependencies)
This works perfectly for Hacker News because it's a plain HTML page with stable, semantic markup.
The selectors are clean (.athing, .titleline) and unlikely to change.
When Level 1 fails: Try this on a React-based news site and you'll get empty results.
The fetch call returns the HTML shell before JavaScript runs, so there's no content to parse.
For React, Vue, Angular, or any site that renders content with JavaScript, you need a real browser. Playwright launches a full Chromium instance, waits for JavaScript to execute, then lets you query the fully-rendered DOM.
Requirements: npm install playwright then npx playwright install chromium
Playwright's strength: It renders the page exactly like a real browser. JavaScript runs, React mounts, lazy content loads. You get the same DOM a human sees.
Playwright's weakness: You still need CSS selectors or XPath to extract the data.
The selector [data-testid="job-card"] above? One sprint and it's gone.
You're also responsible for installation, browser binary management, and memory usage —
Chromium is not lightweight.
AI extraction is what happens when you combine a real browser (handling JavaScript rendering) with a language model (handling data extraction). You describe what you want in plain English. The AI figures out where it is on the page and returns structured JSON.
No selectors. No DOM queries. No maintenance when sites redesign.
| Factor | Level 1 (Cheerio) | Level 2 (Playwright) | Level 3 (AI API) |
|---|---|---|---|
| Site type | Static / server-rendered | JavaScript-rendered | Any site |
| Setup complexity | Low (npm install cheerio) | Medium (browsers, RAM) | Low (one API key) |
| Maintenance burden | Medium (selectors break) | High (selectors + browser) | Very low (AI adapts) |
| Speed per page | Fast (<1s) | Medium (3–8s) | Slower (8–15s) |
| Cost | Free (server resources) | Free + server RAM | API pricing |
| Anti-bot handling | None | Partial | Built in |
| Best for | High-volume, stable sites | JS sites, known structure | Teams, complex data, many sites |
For solo developers experimenting or scraping a handful of stable sites, Level 1 or Level 2 makes sense. But when you're building a data pipeline that a team depends on, the math changes.
Every hour spent debugging a broken selector is an hour not spent building product. Every deployment that breaks a scraper is an incident for on-call engineers. The maintenance cost of keeping N scrapers working across N sites is non-trivial at any meaningful scale.
Teams increasingly choose the API approach because it shifts maintenance responsibility to the provider and lets developers focus on what the data enables — not how to extract it.
In production, you might use all three:
Papalily's AI extraction handles JavaScript rendering, anti-bot measures, and data structuring — all in one API call. No Cheerio, no Playwright, no selectors. Works on any public website. Free to start.
Get Free API Key on RapidAPI →Full docs at papalily.com/docs