Tools, libraries, guides, and APIs for extracting data from the web. Curated for developers who actually build scrapers.
Microsoft's end-to-end testing and automation library. Supports Chromium, Firefox, and WebKit. Excellent for scraping React/Vue/Angular apps that require JavaScript execution. Handles most modern web patterns.
Google's Node.js library for controlling Chrome/Chromium via the DevTools Protocol. The original headless Chrome scraping tool. Slightly lower-level than Playwright but extremely well-documented with a large community.
The original browser automation framework, widely used for testing and scraping. Supports all major browsers and multiple programming languages. More verbose than Playwright/Puppeteer but has the largest ecosystem and most Stack Overflow answers.
Skip the browser setup entirely. POST a URL + plain-English prompt, get back structured JSON. Papalily runs Playwright Chromium for you and uses Gemini AI to extract exactly what you asked for. Free tier available.
Python library for parsing HTML and XML. Extremely beginner-friendly with a simple, Pythonic API. Best for scraping static HTML — pair with Playwright or requests-html for JavaScript-rendered sites. The most-used Python scraping library.
Fast, flexible HTML parser for Node.js with a jQuery-like API. Ideal for parsing static HTML responses. Not a browser — use with Playwright or Puppeteer when JavaScript rendering is required. Very fast and memory-efficient.
High-performance XML and HTML parsing library for Python. Much faster than Beautiful Soup for large documents. Supports XPath and CSS selectors. The go-to choice when you need speed and are parsing large volumes of HTML.
Java HTML parser with a CSS selector-based API. The standard choice for Java-based scraping. Handles malformed HTML gracefully. Good documentation and active maintenance.
Python's most powerful web scraping framework. Async architecture handles high volumes efficiently. Built-in support for pipelines, middleware, item processors, and export formats (JSON, CSV, XML). Best for large-scale structured crawls.
Modern Node.js web scraping library by Apify. Batteries-included: automatic retries, proxy rotation, session management, storage. Supports Playwright and Puppeteer as browser backends. Great developer experience.
Python library for automating interaction with websites. Simulates a browser without a real browser — good for form submission and simple navigation on non-JS sites. Built on requests + Beautiful Soup.
Python library that adds JavaScript rendering (via Pyppeteer) to the familiar requests API. Good middle ground for moderately JS-heavy sites. Easy to adopt if you already use requests.
Proxy-based scraping service that handles IP rotation, CAPTCHAs, and browser fingerprinting. Returns raw HTML — you still need to parse it with your own selectors. Good for high-volume raw HTML fetching.
Enterprise proxy network with residential, datacenter, and mobile IPs. One of the largest proxy providers. Also offers Scraping Browser (Playwright-compatible) and pre-built datasets. Best for large enterprise scraping.
Premium proxy and web data platform. Residential proxies, datacenter proxies, and a Scraper API. Strong reputation for reliability. Higher price point, aimed at enterprise use cases.
API-based scraping service with anti-bot bypass. Returns HTML or structured data. Developer-friendly pricing starting from $49/month. Good documentation and JS rendering support.
Comprehensive documentation covering installation, core concepts, API reference, and examples. The best starting point for learning Playwright for scraping or testing.
Official Scrapy tutorial walking you through building a full scraper from scratch. Covers spiders, selectors, pipelines, and deploying. The most complete introduction to structured crawling in Python.
Deep dive into why CSS selectors break on React/Vue sites and how AI-based extraction solves the selector maintenance problem. Includes code examples comparing Playwright DIY vs AI API.
Practical guide covering common Amazon anti-bot mechanisms, proxy strategies, and when to use API-based scraping instead of DIY. Covers rate limiting, user agent rotation, and JavaScript rendering requirements.
When to use Playwright (DIY) vs an AI scraping API. Covers performance, cost, maintenance burden, and use case fit. Honest comparison including where each approach fails.
Side-by-side comparison of the top web scraping APIs. Covers pricing, features, use cases, and limitations. Updated for 2026 with AI-native options included.
Free HTTP request & response testing service. Invaluable for testing headers, cookies, user agents, and proxy configurations during scraper development. No signup required.
Command-line JSON processor. Essential for working with scraped JSON data — filter, transform, and format JSON from the terminal. Pairs well with cURL-based scraping workflows.
Ready-to-run code examples for the Papalily API in Node.js, Python, PHP, and cURL. Covers basic scraping, batch requests, e-commerce, job listings, and more. Free to copy and use.
Complete API documentation in LLM-friendly plain text format. Useful for feeding into AI assistants (ChatGPT, Claude, Copilot) to get accurate code suggestions for Papalily integration.
Papalily handles the browser, the rendering, and the extraction. You just describe what you want.