Web Scraping Resources

Tools, libraries, guides, and APIs for extracting data from the web. Curated for developers who actually build scrapers.

🌸 Want to skip the setup entirely? Try Papalily — POST a URL + plain-English prompt, get back structured JSON. Real Chromium browser, AI extraction, no selectors needed.

Try Free (50 req/mo) →

Browser Automation Full browser control for JS-heavy sites

Playwright

Microsoft's end-to-end testing and automation library. Supports Chromium, Firefox, and WebKit. Excellent for scraping React/Vue/Angular apps that require JavaScript execution. Handles most modern web patterns.

Node.js Python Java Open Source

Puppeteer

Google's Node.js library for controlling Chrome/Chromium via the DevTools Protocol. The original headless Chrome scraping tool. Slightly lower-level than Playwright but extremely well-documented with a large community.

Node.js Open Source

Selenium

The original browser automation framework, widely used for testing and scraping. Supports all major browsers and multiple programming languages. More verbose than Playwright/Puppeteer but has the largest ecosystem and most Stack Overflow answers.

Python Java Node.js Ruby Open Source

Papalily ✨ API Option

Skip the browser setup entirely. POST a URL + plain-English prompt, get back structured JSON. Papalily runs Playwright Chromium for you and uses Gemini AI to extract exactly what you asked for. Free tier available.

REST API AI-Powered Free Tier

HTML Parsing For static sites and server-rendered pages

Beautiful Soup

Python library for parsing HTML and XML. Extremely beginner-friendly with a simple, Pythonic API. Best for scraping static HTML — pair with Playwright or requests-html for JavaScript-rendered sites. The most-used Python scraping library.

Python Open Source

Cheerio

Fast, flexible HTML parser for Node.js with a jQuery-like API. Ideal for parsing static HTML responses. Not a browser — use with Playwright or Puppeteer when JavaScript rendering is required. Very fast and memory-efficient.

Node.js Open Source

lxml

High-performance XML and HTML parsing library for Python. Much faster than Beautiful Soup for large documents. Supports XPath and CSS selectors. The go-to choice when you need speed and are parsing large volumes of HTML.

Python Open Source

jsoup

Java HTML parser with a CSS selector-based API. The standard choice for Java-based scraping. Handles malformed HTML gracefully. Good documentation and active maintenance.

Java Open Source

Scraping Frameworks Full-featured crawling platforms

Scrapy

Python's most powerful web scraping framework. Async architecture handles high volumes efficiently. Built-in support for pipelines, middleware, item processors, and export formats (JSON, CSV, XML). Best for large-scale structured crawls.

Python Open Source

Crawlee

Modern Node.js web scraping library by Apify. Batteries-included: automatic retries, proxy rotation, session management, storage. Supports Playwright and Puppeteer as browser backends. Great developer experience.

Node.js Open Source

MechanicalSoup

Python library for automating interaction with websites. Simulates a browser without a real browser — good for form submission and simple navigation on non-JS sites. Built on requests + Beautiful Soup.

Python Open Source

Requests-HTML

Python library that adds JavaScript rendering (via Pyppeteer) to the familiar requests API. Good middle ground for moderately JS-heavy sites. Easy to adopt if you already use requests.

Python Open Source

Proxy & Anti-Bot Services For sites with blocking protection

ScraperAPI

Proxy-based scraping service that handles IP rotation, CAPTCHAs, and browser fingerprinting. Returns raw HTML — you still need to parse it with your own selectors. Good for high-volume raw HTML fetching.

Paid Service Proxy-based

Bright Data

Enterprise proxy network with residential, datacenter, and mobile IPs. One of the largest proxy providers. Also offers Scraping Browser (Playwright-compatible) and pre-built datasets. Best for large enterprise scraping.

Paid Service Enterprise

Oxylabs

Premium proxy and web data platform. Residential proxies, datacenter proxies, and a Scraper API. Strong reputation for reliability. Higher price point, aimed at enterprise use cases.

Paid Service Enterprise

ZenRows

API-based scraping service with anti-bot bypass. Returns HTML or structured data. Developer-friendly pricing starting from $49/month. Good documentation and JS rendering support.

Paid Service

Tutorials & Guides Learning resources for all levels

Playwright Official Docs

Comprehensive documentation covering installation, core concepts, API reference, and examples. The best starting point for learning Playwright for scraping or testing.

Free Official

Scrapy Tutorial

Official Scrapy tutorial walking you through building a full scraper from scratch. Covers spiders, selectors, pipelines, and deploying. The most complete introduction to structured crawling in Python.

Free Official

How AI Scraping Works — Papalily Blog

Deep dive into why CSS selectors break on React/Vue sites and how AI-based extraction solves the selector maintenance problem. Includes code examples comparing Playwright DIY vs AI API.

Free AI Scraping

How to Scrape Amazon Without Getting Blocked

Practical guide covering common Amazon anti-bot mechanisms, proxy strategies, and when to use API-based scraping instead of DIY. Covers rate limiting, user agent rotation, and JavaScript rendering requirements.

Free Advanced

Playwright vs AI Scraping — Comparison

When to use Playwright (DIY) vs an AI scraping API. Covers performance, cost, maintenance burden, and use case fit. Honest comparison including where each approach fails.

Free Comparison

Best Web Scraping APIs in 2026

Side-by-side comparison of the top web scraping APIs. Covers pricing, features, use cases, and limitations. Updated for 2026 with AI-native options included.

Free 2026

Tools & Utilities Helpers for building scrapers

HTTPBin

Free HTTP request & response testing service. Invaluable for testing headers, cookies, user agents, and proxy configurations during scraper development. No signup required.

Free Testing

jq

Command-line JSON processor. Essential for working with scraped JSON data — filter, transform, and format JSON from the terminal. Pairs well with cURL-based scraping workflows.

Open Source CLI

Papalily Examples Repo

Ready-to-run code examples for the Papalily API in Node.js, Python, PHP, and cURL. Covers basic scraping, batch requests, e-commerce, job listings, and more. Free to copy and use.

Free AI Scraping

Papalily AI Docs (llms-full.txt)

Complete API documentation in LLM-friendly plain text format. Useful for feeding into AI assistants (ChatGPT, Claude, Copilot) to get accurate code suggestions for Papalily integration.

Free AI-Friendly

Skip the setup. Get structured data.

Papalily handles the browser, the rendering, and the extraction. You just describe what you want.

Get Free API Key → Read the Docs