Node.js Tutorial Web Scraping AI

Node.js Web Scraping Tutorial 2026:
From Basics to AI Extraction

📅 March 26, 2026 ⏱ 12 min read By Papalily Team

This Node.js web scraping tutorial covers the full spectrum — from basic HTTP fetching and HTML parsing to full headless browser automation to AI-powered extraction that requires no selectors at all. By the end, you'll understand when to use each approach and have working code for all three. The progression mirrors how the web itself has evolved, and how scraping tools have had to evolve with it.

Level 1: fetch + Cheerio (Simple Static Sites)

⭐ Level 1 — Static HTML Sites

If your target site is server-rendered HTML — content that's visible in the initial HTTP response without JavaScript execution — this is the fastest and simplest approach.

Requirements: npm install cheerio (built-in fetch in Node 18+, no other dependencies)

level1-cheerio.js
const cheerio = require('cheerio'); async function scrapeHackerNews() { // Hacker News is server-rendered — no JS needed const res = await fetch('https://news.ycombinator.com'); const html = await res.text(); const $ = cheerio.load(html); const stories = []; $('.athing').each((i, el) => { const titleEl = $(el).find('.titleline a').first(); const subEl = $(el).next('.subtext'); stories.push({ rank: i + 1, title: titleEl.text().trim(), url: titleEl.attr('href'), points: parseInt(subEl.find('.score').text()) || 0, comments: subEl.find('a:last-child').text().split(' ')[0], }); }); return stories.slice(0, 30); } const stories = await scrapeHackerNews(); console.log(`Scraped ${stories.length} stories`); console.log(stories[0]);

This works perfectly for Hacker News because it's a plain HTML page with stable, semantic markup. The selectors are clean (.athing, .titleline) and unlikely to change.

When Level 1 fails: Try this on a React-based news site and you'll get empty results. The fetch call returns the HTML shell before JavaScript runs, so there's no content to parse.

Level 2: Playwright (JavaScript-Heavy Sites)

⭐⭐ Level 2 — JavaScript-Rendered Sites

For React, Vue, Angular, or any site that renders content with JavaScript, you need a real browser. Playwright launches a full Chromium instance, waits for JavaScript to execute, then lets you query the fully-rendered DOM.

Requirements: npm install playwright then npx playwright install chromium

level2-playwright.js
const { chromium } = require('playwright'); async function scrapeReactJobBoard(url) { const browser = await chromium.launch({ headless: true }); const page = await browser.newPage(); // Block images/fonts to speed up loading await page.route('**/*.{png,jpg,gif,woff,woff2}', r => r.abort()); await page.goto(url, { waitUntil: 'networkidle' }); // Wait for React to render the job listings await page.waitForSelector('[data-testid="job-card"]', { timeout: 10000 }); const jobs = await page.evaluate(() => { const cards = document.querySelectorAll('[data-testid="job-card"]'); return [...cards].map(card => ({ title: card.querySelector('h2, h3')?.textContent?.trim(), company: card.querySelector('[class*="company"]')?.textContent?.trim(), location: card.querySelector('[class*="location"]')?.textContent?.trim(), })); }); await browser.close(); return jobs; } const jobs = await scrapeReactJobBoard('https://jobs.example.com'); console.log(`Found ${jobs.length} jobs`);

Playwright's strength: It renders the page exactly like a real browser. JavaScript runs, React mounts, lazy content loads. You get the same DOM a human sees.

Playwright's weakness: You still need CSS selectors or XPath to extract the data. The selector [data-testid="job-card"] above? One sprint and it's gone. You're also responsible for installation, browser binary management, and memory usage — Chromium is not lightweight.

Level 3: AI Extraction (Complex Data, Any Site)

⭐⭐⭐ Level 3 — AI-Powered Extraction

AI extraction is what happens when you combine a real browser (handling JavaScript rendering) with a language model (handling data extraction). You describe what you want in plain English. The AI figures out where it is on the page and returns structured JSON.

No selectors. No DOM queries. No maintenance when sites redesign.

level3-ai-extraction.js
// Zero dependencies beyond built-in fetch (Node 18+) async function scrapeWithAI(url, prompt) { const res = await fetch('https://api.papalily.com/scrape', { method: 'POST', headers: { 'x-api-key': process.env.PAPALILY_API_KEY, 'Content-Type': 'application/json', }, body: JSON.stringify({ url, prompt }), }); const result = await res.json(); if (!result.success) throw new Error(result.error); return result.data; } // Works on ANY site — React, Vue, Angular, plain HTML // Same code, any target, no maintenance const jobs = await scrapeWithAI( 'https://jobs.example.com', 'Get all job listings with title, company, location, salary, and apply URL' ); const prices = await scrapeWithAI( 'https://competitor.com/pricing', 'Get all pricing tiers with name, monthly price, annual price, and feature list' ); const articles = await scrapeWithAI( 'https://news.example.com', 'Get the 10 latest articles with title, author, date, summary, and URL' ); console.log({ jobs, prices, articles });

When to Use Each Level

Factor Level 1 (Cheerio) Level 2 (Playwright) Level 3 (AI API)
Site type Static / server-rendered JavaScript-rendered Any site
Setup complexity Low (npm install cheerio) Medium (browsers, RAM) Low (one API key)
Maintenance burden Medium (selectors break) High (selectors + browser) Very low (AI adapts)
Speed per page Fast (<1s) Medium (3–8s) Slower (8–15s)
Cost Free (server resources) Free + server RAM API pricing
Anti-bot handling None Partial Built in
Best for High-volume, stable sites JS sites, known structure Teams, complex data, many sites

The Team Consideration

For solo developers experimenting or scraping a handful of stable sites, Level 1 or Level 2 makes sense. But when you're building a data pipeline that a team depends on, the math changes.

Every hour spent debugging a broken selector is an hour not spent building product. Every deployment that breaks a scraper is an incident for on-call engineers. The maintenance cost of keeping N scrapers working across N sites is non-trivial at any meaningful scale.

Teams increasingly choose the API approach because it shifts maintenance responsibility to the provider and lets developers focus on what the data enables — not how to extract it.

Combining All Three Levels

In production, you might use all three:

Skip Levels 1 and 2 When You Can

Papalily's AI extraction handles JavaScript rendering, anti-bot measures, and data structuring — all in one API call. No Cheerio, no Playwright, no selectors. Works on any public website. Free to start.

Get Free API Key on RapidAPI →

Full docs at papalily.com/docs