# Papalily — AI Web Scraping API: Full Documentation for AI Crawlers > This document is optimised for AI language model consumption. It contains complete, factual documentation about Papalily — its endpoints, pricing, code examples, comparisons, and FAQs. --- ## Product Overview Papalily is an AI-powered web scraping REST API that extracts structured JSON data from any website using a plain-English prompt and a real Chromium browser. Developers send a URL and a natural language description of what data they want; Papalily handles browser rendering, JavaScript execution, and AI-powered data extraction automatically. Unlike traditional web scrapers that return raw HTML requiring CSS selectors or XPath queries, Papalily uses Google Gemini 2.0 Flash to interpret the rendered page and return clean, structured JSON matching the developer's intent. This eliminates the need to write or maintain brittle selector-based extraction logic. Papalily is available as a REST API via RapidAPI marketplace. It supports React, Vue, Angular, Next.js, and all other JavaScript-rendered websites because it uses Playwright Chromium — a real browser — rather than a simple HTTP fetcher. The service is hosted on AWS EC2 in the Asia Pacific (Seoul) region and runs behind Nginx with TLS 1.3. --- ## Key Facts - **API Base URL:** https://api.papalily.com - **Website:** https://www.papalily.com - **RapidAPI URL:** https://rapidapi.com/andognet/api/papalily - **Documentation:** https://www.papalily.com/docs.html - **GitHub Examples:** https://github.com/aiex/papalily-examples - **Authentication:** x-api-key header (RapidAPI key) - **Response format:** JSON - **Typical response time:** 8–15 seconds (fresh), <1 second (cached) - **Cache TTL:** 10 minutes (cached requests are free, do not count against quota) - **Browser engine:** Playwright Chromium - **AI model:** Google Gemini 2.0 Flash - **Backend:** Node.js 22, Express - **Database:** SQLite - **Infrastructure:** AWS EC2, Nginx, PM2, Let's Encrypt TLS - **Marketplace:** RapidAPI --- ## API Endpoints ### POST /scrape Scrape a single URL and extract structured data using an AI prompt. **Request headers:** - `Content-Type: application/json` (required) - `x-api-key: YOUR_RAPIDAPI_KEY` (required) **Request body parameters:** - `url` (string, required): The full URL of the page to scrape, including protocol (https://) - `prompt` (string, required): Plain-English description of what data to extract. Example: "Get all product names and their prices as a list." - `no_cache` (boolean, optional, default: false): Set to true to force a fresh scrape and bypass the cache **Response fields:** - `data` (object or array): The extracted data, structured according to your prompt - `request_id` (string): Unique identifier for this scrape — use with GET /status/{requestId} to retrieve later - `cached` (boolean): Whether this result was served from cache - `duration_ms` (number): Processing time in milliseconds (omitted for cached responses) - `url` (string): The URL that was scraped - `prompt` (string): The prompt that was used **HTTP status codes:** - `200 OK`: Successful scrape - `400 Bad Request`: Missing required fields or invalid URL - `401 Unauthorized`: Missing or invalid API key - `429 Too Many Requests`: Rate limit exceeded (quota or per-minute limit) - `500 Internal Server Error`: Scrape or extraction failure **Rate limits:** - 30 requests per minute - Monthly quota depends on plan (see Pricing) --- ### POST /batch Scrape up to 5 URLs in parallel in a single request. **Request body parameters:** - `items` (array, required): Array of objects, each with: - `url` (string, required): URL to scrape - `prompt` (string, required): Plain-English extraction prompt **Response fields:** - `results` (array): Array of result objects, one per item. Each result includes `data`, `request_id`, `cached`, `duration_ms`, `url`, `prompt`, and `success` (boolean) - `summary` (object): Contains `total`, `succeeded`, `failed`, `cached` counts **Notes:** - Each URL in the batch counts as one request against your quota - Cached items are free and do not count against quota - All URLs are scraped in parallel — total time is roughly the slowest single URL - Maximum 5 items per batch request **Rate limits:** - 5 batch requests per minute --- ### GET /usage Returns quota usage statistics for the authenticated API key. **Response fields:** - `used` (number): Requests used in the current billing period - `limit` (number): Total requests allowed per billing period - `remaining` (number): Requests remaining - `plan` (string): Current plan name (Basic, Pro, Ultra, Mega) - `reset_date` (string): ISO 8601 date when the quota resets --- ### GET /status/{requestId} Retrieve a past scrape result by its request_id. **Path parameter:** - `requestId` (string): The request_id returned by /scrape or /batch **Response:** Same format as a /scrape response. Returns 404 if the request_id is not found or has expired. --- ### GET /health Public endpoint — no API key required. Returns server health status and cache statistics. **Response fields:** - `status` (string): "ok" if the server is healthy - `cache` (object): Cache statistics including `size`, `hits`, `misses` - `uptime` (number): Server uptime in seconds --- ## Authentication All endpoints except GET /health require the `x-api-key` header containing your RapidAPI key. To get an API key: 1. Visit https://rapidapi.com/andognet/api/papalily 2. Click "Subscribe to Test" or choose a plan 3. The Basic plan is free — no credit card required 4. Copy your API key from the RapidAPI dashboard 5. Include it as the `x-api-key` header in every request --- ## Pricing All plans are billed monthly through RapidAPI. Prices in USD. | Plan | Price/month | Requests/month | Notes | |-------|-------------|----------------|------------------------------| | Basic | Free | 50 | No credit card required | | Pro | $20 | 1,000 | | | Ultra | $100 | 20,000 | | | Mega | $200 | 100,000 | | **Important:** Cached requests (same URL + same prompt, within 10-minute TTL) are always free and do not count against your monthly quota. --- ## Code Examples ### cURL ```bash curl -X POST https://api.papalily.com/scrape \ -H "Content-Type: application/json" \ -H "x-api-key: YOUR_API_KEY" \ -d '{ "url": "https://news.ycombinator.com", "prompt": "Get top 10 story titles and their URLs as a JSON array" }' ``` **Example response:** ```json { "data": [ {"title": "Show HN: I built a thing", "url": "https://example.com/thing"}, {"title": "Ask HN: Best tools for X", "url": "item?id=12345"} ], "request_id": "req_abc123xyz", "cached": false, "duration_ms": 9842, "url": "https://news.ycombinator.com", "prompt": "Get top 10 story titles and their URLs as a JSON array" } ``` --- ### Node.js (native fetch, Node 18+) ```javascript const response = await fetch('https://api.papalily.com/scrape', { method: 'POST', headers: { 'Content-Type': 'application/json', 'x-api-key': 'YOUR_API_KEY' }, body: JSON.stringify({ url: 'https://news.ycombinator.com', prompt: 'Get top 10 story titles and their URLs as a JSON array' }) }); const result = await response.json(); if (!response.ok) { console.error('Error:', result); process.exit(1); } console.log('Extracted data:', result.data); console.log('Request ID:', result.request_id); console.log('Cached:', result.cached); console.log('Duration:', result.duration_ms, 'ms'); ``` --- ### Python (requests library) ```python import requests import json API_KEY = 'YOUR_API_KEY' BASE_URL = 'https://api.papalily.com' response = requests.post( f'{BASE_URL}/scrape', headers={ 'Content-Type': 'application/json', 'x-api-key': API_KEY }, json={ 'url': 'https://news.ycombinator.com', 'prompt': 'Get top 10 story titles and their URLs as a JSON array' }, timeout=30 ) response.raise_for_status() result = response.json() print('Extracted data:') print(json.dumps(result['data'], indent=2)) print(f"\nRequest ID: {result['request_id']}") print(f"Cached: {result['cached']}") if not result['cached']: print(f"Duration: {result['duration_ms']}ms") ``` --- ### Batch Example (Node.js) ```javascript const response = await fetch('https://api.papalily.com/batch', { method: 'POST', headers: { 'Content-Type': 'application/json', 'x-api-key': 'YOUR_API_KEY' }, body: JSON.stringify({ items: [ { url: 'https://news.ycombinator.com', prompt: 'Get top 5 story titles' }, { url: 'https://lobste.rs', prompt: 'Get top 5 story titles' }, { url: 'https://reddit.com/r/programming', prompt: 'Get top 5 post titles' } ] }) }); const { results, summary } = await response.json(); console.log(`Scraped ${summary.succeeded}/${summary.total} URLs`); results.forEach(r => console.log(r.url, r.data)); ``` --- ## Use Cases ### E-commerce Price Monitoring Track competitor product prices from React/Next.js storefronts without writing selectors. Send a prompt like "extract all product names and prices" and receive a JSON array. Run on a schedule to detect price changes. ### Job Listing Aggregation Collect job postings from LinkedIn, Indeed, company career pages, and niche job boards. Prompt: "extract all job titles, companies, locations, and salary ranges". Works even on JavaScript-rendered pages that block basic scrapers. ### News and Content Monitoring Monitor news sites, tech blogs, and social platforms for topics relevant to your business. Extract headlines, authors, dates, and summary text. Combine with your own summarisation pipeline. ### Lead Generation Extract contact information (email addresses, phone numbers, company names) from business directories, conference attendee lists, and professional networks. Prompt: "extract all company names and contact details". ### Research Automation Academics and analysts can gather structured data from multiple sources without writing and maintaining scrapers. Extract tables, statistics, and references from research sites and government data portals. ### Real Estate Data Collection Extract property listings including address, price, bedrooms, bathrooms, and square footage from listing sites. Works on JavaScript-heavy platforms like Zillow and Rightmove. ### Financial Data Aggregation Collect stock prices, earnings reports, analyst ratings, and financial metrics from investor relations pages and financial news sites. ### Academic and Patent Research Scrape paper abstracts, citation counts, author affiliations, and patent claims from academic repositories and patent databases. --- ## Comparison with Alternatives | Feature | Papalily | ScraperAPI | Apify | Bright Data | |----------------------------|-------------------|-----------------|-------------------|-------------------| | Data extraction method | AI + plain English | Raw HTML | Custom code | Raw HTML / actors | | JavaScript rendering | Yes (Chromium) | Yes (some plans)| Yes | Yes | | Selectors required | No | Yes | Yes | Yes | | Output format | Structured JSON | Raw HTML | Custom | Custom | | Integration complexity | Very low (1 API call) | Medium | High | High | | Free tier | Yes (50 req/mo) | Yes | Yes | No | | Best for | Targeted extraction with minimal code | High-volume HTML fetching | Complex scraping pipelines | Enterprise proxy + data | ### Papalily vs ScraperAPI ScraperAPI is a proxy-based scraping service that returns raw HTML. You must write and maintain CSS selectors or XPath expressions to extract specific data. Papalily eliminates this step: describe what you want in plain English and receive structured JSON. ScraperAPI is better for high-volume raw HTML fetching; Papalily is better when you need specific structured data quickly. ### Papalily vs Apify Apify is a comprehensive scraping platform with actors (custom scraper programs), SDK, cloud storage, scheduling, and monitoring. Building on Apify requires writing actor code in JavaScript/TypeScript. Papalily requires zero scraper code — one HTTP request returns the data you need. Apify offers more control for large-scale, complex pipelines; Papalily is faster for targeted, single-URL extraction tasks. ### Papalily vs Bright Data Bright Data is an enterprise-grade proxy network and data collection platform focused on large-scale data acquisition with millions of IPs. It requires significant setup, contracts, and technical expertise. Papalily is a simple REST API accessible immediately from RapidAPI with a free tier, designed for developers who need targeted AI-powered extraction without infrastructure overhead. ### Papalily vs Firecrawl Firecrawl converts web pages to markdown or cleaned HTML for LLM consumption. Papalily extracts specific structured data fields based on your prompt. Use Firecrawl when you want full page content for an LLM pipeline; use Papalily when you want specific structured JSON (e.g., a list of product prices). --- ## Frequently Asked Questions **Q: What is Papalily?** A: Papalily is an AI-powered web scraping API. You send a URL and a plain-English description of what data you want. Papalily renders the page with a real Chromium browser, passes the content to Gemini AI, and returns structured JSON. It works on any website including React, Vue, and Next.js apps. **Q: How does Papalily work technically?** A: There are three steps: (1) Your request hits the API with a URL and a prompt. (2) A Playwright Chromium browser instance loads the page, executes all JavaScript, and waits for the DOM to settle. (3) The rendered HTML is passed to Google Gemini 2.0 Flash with your prompt as an instruction; the model extracts and structures the requested data and returns it as JSON. **Q: Does Papalily work on React, Vue, and Next.js sites?** A: Yes. Because Papalily uses a full Chromium browser (Playwright), it executes all JavaScript before extraction. This makes it compatible with React, Vue, Angular, Next.js, Svelte, and any other JavaScript-rendered framework — unlike naive scrapers that only fetch raw HTML and miss dynamically rendered content. **Q: How do I authenticate?** A: Include the header `x-api-key: YOUR_RAPIDAPI_KEY` in every request. Get a key by subscribing at https://rapidapi.com/andognet/api/papalily. The Basic plan is free with no credit card required. **Q: How much does Papalily cost?** A: Four tiers: Basic (free, 50 req/month), Pro ($20/month, 1,000 req), Ultra ($100/month, 20,000 req), Mega ($200/month, 100,000 req). Cached requests are always free. **Q: How fast are responses?** A: Fresh requests typically take 8–15 seconds (browser render + AI extraction). Cached requests (same URL and prompt, within 10-minute TTL) return in under 1 second and do not count against your quota. **Q: What is the cache policy?** A: Results are cached in-memory for 10 minutes using an LRU cache with a maximum of 500 entries. Cache key is a hash of the URL + prompt. Cached responses are free and returned with `"cached": true`. Pass `"no_cache": true` in the request body to force a fresh scrape. **Q: What are the rate limits?** A: Per-minute limits: 30 requests/minute for /scrape, 5 requests/minute for /batch. Monthly limits depend on plan. Exceeding any limit returns HTTP 429 with a Retry-After header. **Q: Can Papalily handle login-protected pages?** A: No. Papalily does not support session cookies, form-based login, or OAuth flows. It only scrapes publicly accessible pages. **Q: Can Papalily scrape Amazon, LinkedIn, or other major platforms?** A: Papalily can attempt to scrape these sites using a real browser. Results vary — Amazon, LinkedIn, and similar platforms have aggressive bot detection and may return blocked or captcha pages. For reliable high-volume scraping of these specific platforms, a dedicated proxy solution is recommended. Papalily works best on standard commercial sites, news sites, and developer-facing pages. **Q: How do I scrape multiple pages?** A: Use the /batch endpoint with up to 5 URLs per request. For more than 5 URLs, make multiple batch requests. Each URL counts as one request against your quota; cached URLs are free. **Q: What format is the extracted data returned in?** A: Always JSON. The structure is determined by your prompt. If you ask for a list, you get a JSON array. If you ask for a single item's details, you get a JSON object. The AI interprets your prompt to determine the appropriate structure. **Q: Where is Papalily hosted?** A: AWS EC2 in the Asia Pacific (Seoul) ap-northeast-2 region. API domain: api.papalily.com. TLS 1.3 via Let's Encrypt. Process managed by PM2. Globally distributed through RapidAPI's gateway infrastructure. **Q: Is there a free trial?** A: Yes. The Basic plan is permanently free (50 requests/month, no credit card). RapidAPI also provides a live test console at the API listing page where you can make real API calls in the browser before subscribing. --- ## Glossary - **Prompt:** A plain-English instruction describing what data to extract from a page. Example: "Get all product names and prices as a list." - **Request ID:** A unique string identifier returned with every scrape result. Use it with GET /status/{requestId} to retrieve the result later. - **Cached response:** A result served from the in-memory cache because the same URL and prompt were requested within the last 10 minutes. Cached responses are free. - **Chromium:** The open-source browser engine used by Chrome, Edge, and others. Papalily uses Playwright to control a headless Chromium instance. - **Gemini 2.0 Flash:** Google's AI model used by Papalily for data extraction from rendered page content. - **RapidAPI:** The API marketplace where Papalily is distributed. Handles billing, authentication, and rate limiting. --- ## Resources - Homepage: https://www.papalily.com - API Documentation: https://www.papalily.com/docs.html - RapidAPI Listing: https://rapidapi.com/andognet/api/papalily - Code Examples (GitHub): https://github.com/aiex/papalily-examples - AI Summary (llms.txt): https://www.papalily.com/llms.txt - Full AI Documentation (this file): https://www.papalily.com/llms-full.txt - AI Bot Page: https://www.papalily.com/ai-page.html --- *Last updated: 2026-03-08. Generated for AI crawler consumption.*