Major new capability: interactive browser automation. Papalily can now execute JavaScript, fill forms, click buttons, paginate, and maintain browser sessions across multiple API calls. Includes a natural language task planner that converts plain-English goals into executable steps automatically.
New Endpoints
- POST /interact — Execute a sequence of interactive steps on a real browser page. Accepts
steps (explicit) or task (natural language, AI-planned).
- POST /session/start — Open a persistent browser session. Context stays alive between API calls. Available on Pro and above.
- POST /session/:id/step — Execute one step or a natural-language task on a live session.
- GET /session/:id/state — Get current URL, title, and screenshot of a live session.
- DELETE /session/:id — Close a session and free browser resources.
New Features
- Natural language task planner — Pass
"task": "..." to /interact or /session/:id/step. The AI snapshots the live page, generates a step plan, and executes it automatically. Plans are cached per domain+task for 1 hour.
- CSS schema extraction — New
css_schema step type extracts structured data via CSS selectors with zero AI cost. Faster and cheaper than extract when page structure is known.
- Per-request browser contexts — Each request now gets an isolated Playwright context, preventing cookie/state leaks between users.
- Crash guards —
unhandledRejection and uncaughtException handlers prevent full process crashes on async errors.
- PM2 memory restart — Server auto-restarts at 1GB memory to prevent OOM crashes.
Fixes
- Fixed rate limiter
ERR_ERL_UNEXPECTED_X_FORWARDED_FOR false-alarm errors flooding the log.
- Fixed mobile hamburger menu not working on Blog, Compare, and Resources pages.
- Fixed mobile nav font size inconsistency across pages.
- Standardised navbar links across all pages (Home | Docs | Pricing | Blog | Compare | Changelog | Resources).
Removed the /batch endpoint to protect server stability. Use POST /scrape for each URL individually — either sequentially or in parallel from your own client code.
Breaking Changes
- POST /batch removed — endpoint now returns
410 Gone. The batch endpoint spawned multiple concurrent Playwright browser instances, causing high memory pressure. Replace with individual calls to POST /scrape.
Migration
- For sequential scraping: call
POST /scrape in a loop
- For parallel scraping: call
POST /scrape concurrently from your client (e.g. Promise.all in JS, asyncio.gather in Python)
- Cached results return instantly and are quota-free — repeated URLs benefit automatically
Major rendering reliability update. The API now uses an adaptive content-stability algorithm
to detect when React/Vue/Next.js pages have finished hydrating — replacing fixed wait timers.
Full-page screenshots, proxy support, and automatic non-English translation complete the release.
New Features
- proxy_url parameter — route the browser through any HTTP/HTTPS/SOCKS5 proxy for geo-specific content (e.g. get USD pricing from a US IP)
- Adaptive content-stability wait — polls
innerText length every 600ms and exits only when the page stops changing, instead of a fixed delay
- Lazy-load trigger — automatically scrolls the full page before capture to trigger intersection-observer lazy-loaded components
- Auto-translation — results containing Korean, Japanese, Chinese, or Arabic are automatically translated to English via a second AI pass
- API endpoint request logging — every call to every route is logged with status code and response time
- Analytics dashboard endpoint stats — new table shows total calls, success rate, avg response time, and error count per endpoint
Performance
- Page load:
networkidle → load event — saves 1–3s per request on most sites
- Resource blocking: images, fonts, media, and tracking scripts aborted during navigation
- Browser context reuse — shared context kept alive between requests instead of recreating
- Screenshot quality optimised: full page up to 5000px tall at quality 70 for richer AI analysis
- HTML preprocessed before AI analysis —
<script>, <style>, SVG, and comments stripped
- 15 extra Chrome
--disable-* flags for unused browser services
- Gemini model instance reused at module level (no re-initialisation per request)
Bug Fixes
- Fixed geo-targeted sites (Shopify, etc.) serving localised content to non-US server IPs — added
locale: en-US, Accept-Language, and CF-IPCountry headers
- Fixed Korean/CJK text leaking into extraction results — AI extraction prompt now enforces English; translation safety net as backup
- Cookies cleared between requests — prevented stale geo-targeting cookies from affecting subsequent scrapes
- Removed
responseMimeType: application/json which was suppressing Gemini's language instruction-following
- Fixed duplicate route handlers causing request conflicts
- Analytics dashboard: bar charts now use
%-based widths (was fixed 300px, broke on mobile)
- Analytics mobile: Referrer/IP/Device ID columns hidden on mobile; all tables wrapped in
overflow-x: auto
Infrastructure
- Vision-first extraction — system prompt explicitly instructs AI to study screenshot for pricing cards, grids, and visual tables
- Geo-redirect interception — path-based locale redirects (
/ko/, /ja/, etc.) rewritten to /en/
- Proxy requests use isolated one-time browser contexts — no shared state bleed
🔗 View on RapidAPI
Security hardening, analytics v2, self-hosted tracking dashboard, and mobile optimisation.
New Features
- Self-hosted analytics dashboard at
/analytics — pageviews, click events, top pages, referrers
- Analytics v2: real IP tracking, device ID (
__ppid localStorage UUID), browser fingerprint hash
- Scheduled blog publishing system — 8 SEO posts drip-fed 2x/week via cron job
- GEO/AI optimisation —
ai-page.html served to AI crawlers (GPTBot, ClaudeBot, PerplexityBot) for ChatGPT Search and Perplexity extraction
- Comparison pages:
/compare/ hub, vs ScraperAPI, vs Apify
- Resources page with curated developer tools and backlinks
Performance & Security
- Nginx rate limiting zones: per-endpoint limits (scrape 10r/m, batch 3r/m, general 30r/m)
- HSTS, CSP, gzip, static asset caching on www
MAX_CONCURRENT_SCRAPES = 3 cap to prevent OOM on concurrent Playwright instances
trust proxy 1 set for correct client IP behind Nginx
- Mobile: orbs disabled on screens <768px (removed heavy
filter:blur GPU load)
- Mobile navigation added to all pages (was completely missing)
Bug Fixes
- Fixed
css/ and js/ directories with 700 permissions — Nginx could not read static assets
- Cache now never stores failed scrapes — subsequent requests always retry fresh
- Batch: each URL in batch counts as 1 quota request; cache hits are free
- RapidAPI plan limits correctly read from
x-rapidapi-subscription header on every request
RapidAPI integration, CORS, improved cache logic, and batch endpoint hardening.
New Features
- RapidAPI proxy secret validation (
x-rapidapi-proxy-secret header)
- CORS headers added for browser-side API access
- SEO/GEO meta — robots.txt, sitemap.xml, llms.txt, IndexNow key, JSON-LD schemas
- Blog launched — first post on scraping React sites with AI
- GitHub profile README as DA96 backlink
Improvements
- Cache improved: max 500 entries, LRU eviction, never caches failure responses
- Batch: pre-flight quota check before scraping; per-item quota counting
- RapidAPI plan auto-sync:
BASIC→50, PRO→1000, ULTRA→20000, MEGA→100000 requests
Papalily API launches publicly on RapidAPI. Chromium-based scraping with Gemini AI extraction.
Initial Features
- POST
/scrape — render any URL in Chromium + extract structured JSON via Gemini AI
- POST
/batch — scrape up to 5 URLs in parallel
- GET
/usage — check quota and plan limits
- GET
/health — API status and cache stats
- 10-minute LRU result cache — repeated requests instant and quota-free
- Playwright Chromium headless rendering — handles React, Vue, Next.js, Angular
- Gemini 2.0 Flash AI extraction engine — screenshot + text for maximum accuracy
- Let's Encrypt SSL on all three domains (www, bare, api)
- PM2 process manager with systemd auto-restart