Node.js Web Scraping Tutorial 2026: From Basics to AI Extraction

This Node.js web scraping tutorial covers the full spectrum — from basic HTTP fetching and HTML parsing to full headless browser automation to AI-powered extraction that requires no selectors at all. By the end, you'll understand when to use each approach and have working code for all three. The progression mirrors how the web itself has evolved, and how scraping tools have had to evolve with it.

Level 1: fetch + Cheerio (Simple Static Sites)

⭐ Level 1 — Static HTML Sites

If your target site is server-rendered HTML — content that's visible in the initial HTTP response without JavaScript execution — this is the fastest and simplest approach.

Requirements: npm install cheerio (built-in fetch in Node 18+, no other dependencies)

const cheerio = require('cheerio');

async function scrapeHackerNews() {
  // Hacker News is server-rendered — no JS needed
  const res = await fetch('https://news.ycombinator.com');
  const html = await res.text();

  const $ = cheerio.load(html);
  const stories = [];

  $('.athing').each((i, el) => {
    const titleEl = $(el).find('.titleline a').first();
    const subEl = $(el).next('.subtext');

    stories.push({
      rank: i + 1,
      title: titleEl.text().trim(),
      url: titleEl.attr('href'),
      points: parseInt(subEl.find('.score').text()) || 0,
      comments: subEl.find('a:last-child').text().split(' ')[0],
    });
  });

  return stories.slice(0, 30);
}

const stories = await scrapeHackerNews();
console.log(`Scraped ${stories.length} stories`);
console.log(stories[0]);

This works perfectly for Hacker News because it's a plain HTML page with stable, semantic markup. The selectors are clean (.athing, .titleline) and unlikely to change.

When Level 1 fails: Try this on a React-based news site and you'll get empty results. The fetch call returns the HTML shell before JavaScript runs, so there's no content to parse.

Level 2: Playwright (JavaScript-Heavy Sites)

⭐⭐ Level 2 — JavaScript-Rendered Sites

For React, Vue, Angular, or any site that renders content with JavaScript, you need a real browser. Playwright launches a full Chromium instance, waits for JavaScript to execute, then lets you query the fully-rendered DOM.

Requirements: npm install playwright then npx playwright install chromium

const { chromium } = require('playwright');

async function scrapeReactJobBoard(url) {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage();

  // Block images/fonts to speed up loading
  await page.route('**/*.{png,jpg,gif,woff,woff2}', r => r.abort());

  await page.goto(url, { waitUntil: 'networkidle' });

  // Wait for React to render the job listings
  await page.waitForSelector('[data-testid="job-card"]', { timeout: 10000 });

  const jobs = await page.evaluate(() => {
    const cards = document.querySelectorAll('[data-testid="job-card"]');
    return [...cards].map(card => ({
      title: card.querySelector('h2, h3')?.textContent?.trim(),
      company: card.querySelector('[class*="company"]')?.textContent?.trim(),
      location: card.querySelector('[class*="location"]')?.textContent?.trim(),
    }));
  });

  await browser.close();
  return jobs;
}

const jobs = await scrapeReactJobBoard('https://jobs.example.com');
console.log(`Found ${jobs.length} jobs`);

Playwright's strength: It renders the page exactly like a real browser. JavaScript runs, React mounts, lazy content loads. You get the same DOM a human sees.

Playwright's weakness: You still need CSS selectors or XPath to extract the data. The selector [data-testid="job-card"] above? One sprint and it's gone. You're also responsible for installation, browser binary management, and memory usage — Chromium is not lightweight.

Level 3: AI Extraction (Complex Data, Any Site)

⭐⭐⭐ Level 3 — AI-Powered Extraction

AI extraction is what happens when you combine a real browser (handling JavaScript rendering) with a language model (handling data extraction). You describe what you want in plain English. The AI figures out where it is on the page and returns structured JSON.

No selectors. No DOM queries. No maintenance when sites redesign.

// Zero dependencies beyond built-in fetch (Node 18+)
async function scrapeWithAI(url, prompt) {
  const res = await fetch('https://api.papalily.com/scrape', {
    method: 'POST',
    headers: {
      'x-api-key': process.env.PAPALILY_API_KEY,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({ url, prompt }),
  });

  const result = await res.json();
  if (!result.success) throw new Error(result.error);
  return result.data;
}

// Works on ANY site — React, Vue, Angular, plain HTML
// Same code, any target, no maintenance

const jobs = await scrapeWithAI(
  'https://jobs.example.com',
  'Get all job listings with title, company, location, salary, and apply URL'
);

const prices = await scrapeWithAI(
  'https://competitor.com/pricing',
  'Get all pricing tiers with name, monthly price, annual price, and feature list'
);

const articles = await scrapeWithAI(
  'https://news.example.com',
  'Get the 10 latest articles with title, author, date, summary, and URL'
);

console.log({ jobs, prices, articles });

When to Use Each Level

Factor	Level 1 (Cheerio)	Level 2 (Playwright)	Level 3 (AI API)
Site type	Static / server-rendered	JavaScript-rendered	Any site
Setup complexity	Low (npm install cheerio)	Medium (browsers, RAM)	Low (one API key)
Maintenance burden	Medium (selectors break)	High (selectors + browser)	Very low (AI adapts)
Speed per page	Fast (<1s)	Medium (3–8s)	Slower (8–15s)
Cost	Free (server resources)	Free + server RAM	API pricing
Anti-bot handling	None	Partial	Built in
Best for	High-volume, stable sites	JS sites, known structure	Teams, complex data, many sites

The Team Consideration

For solo developers experimenting or scraping a handful of stable sites, Level 1 or Level 2 makes sense. But when you're building a data pipeline that a team depends on, the math changes.

Every hour spent debugging a broken selector is an hour not spent building product. Every deployment that breaks a scraper is an incident for on-call engineers. The maintenance cost of keeping N scrapers working across N sites is non-trivial at any meaningful scale.

Teams increasingly choose the API approach because it shifts maintenance responsibility to the provider and lets developers focus on what the data enables — not how to extract it.

Combining All Three Levels

In production, you might use all three:

Level 1 for high-frequency checks on stable, server-rendered sites you control or know won't change (Wikipedia, government data portals)
Level 2 when you need full browser control — clicking, scrolling, handling auth flows in a testing context
Level 3 for external sites you don't control, React apps, sites that change often, and when you need clean JSON without investing in parser maintenance

Skip Levels 1 and 2 When You Can

Papalily's AI extraction handles JavaScript rendering, anti-bot measures, and data structuring — all in one API call. No Cheerio, no Playwright, no selectors. Works on any public website. Free to start.

Get Free API Key on RapidAPI →

Full docs at papalily.com/docs