PDF Extraction Document Processing OCR

Data Extraction from
PDFs and Documents

📅 June 10, 2026 ⏱ 10 min read By Papalily Team

PDFs and documents contain a wealth of valuable information, but extracting that data programmatically presents unique challenges. Unlike structured web pages, documents come in various formats, layouts, and quality levels. From scanned invoices to multi-page reports, extracting meaningful data requires specialized techniques that combine optical character recognition (OCR), layout analysis, and intelligent parsing. This guide explores modern approaches to document data extraction and how AI is transforming this traditionally difficult task.

Why Document Data Extraction Matters

Organizations process millions of documents daily. Invoices, receipts, contracts, forms, and reports contain critical business data that traditionally required manual entry. Automated document extraction delivers significant benefits:

The global intelligent document processing market is projected to reach $5.2 billion by 2027, driven by demand for automated solutions that can handle the document deluge efficiently.

Types of Documents and Extraction Challenges

Different document types present distinct extraction challenges. Understanding these differences helps you choose the right approach:

Native PDFs vs. Scanned Documents

Native PDFs contain embedded text that can be extracted directly. These are generated from applications like Word, Excel, or design software. Extraction is straightforward using PDF parsing libraries.

Scanned documents are essentially images. The text is visual, not embedded, requiring OCR to recognize characters. Quality varies based on scan resolution, lighting, and document condition. OCR accuracy typically ranges from 85-99% depending on these factors.

Structured vs. Unstructured Documents

Structured documents like forms and invoices follow predictable layouts. Fields appear in consistent locations, making template-based extraction effective. These include tax forms, insurance claims, and standardized invoices.

Unstructured documents like contracts, reports, and letters have variable layouts. Information appears in different positions, requiring AI-powered extraction that can understand context and semantics rather than relying on position.

Traditional Extraction Methods

Before AI-powered solutions, document extraction relied on rule-based approaches. These methods still have their place for simple, consistent documents:

pdf-extraction-basic.js
// Extract text from native PDFs using pdf-parse const pdfParse = require('pdf-parse'); const fs = require('fs'); async function extractTextFromPDF(filePath) { const buffer = fs.readFileSync(filePath); const data = await pdfParse(buffer); return { text: data.text, pageCount: data.numpages, info: data.info }; } // Simple regex-based extraction for structured data function extractInvoiceData(text) { const patterns = { invoiceNumber: /Invoice\s*#?\s*[:\-]?\s*(\w+)/i, date: /Date\s*[:\-]?\s*(\d{1,2}[\/\-\.]\d{1,2}[\/\-\.]\d{2,4})/i, total: /Total\s*[:\-]?\s*[$€£]?\s*([\d,]+\.?\d*)/i, dueDate: /Due\s*Date\s*[:\-]?\s*(\d{1,2}[\/\-\.]\d{1,2}[\/\-\.]\d{2,4})/i }; const extracted = {}; Object.entries(patterns).forEach(([field, pattern]) => { const match = text.match(pattern); extracted[field] = match ? match[1] : null; }); return extracted; }

Template-Based Extraction

For documents with consistent layouts, template-based extraction defines regions of interest. This approach works well for standardized forms and invoices from the same vendor:

template-extraction.js
// Define extraction zones for a specific invoice template const invoiceTemplate = { vendor: { x: 50, y: 50, width: 200, height: 30 }, invoiceNumber: { x: 400, y: 80, width: 150, height: 20 }, invoiceDate: { x: 400, y: 105, width: 150, height: 20 }, totalAmount: { x: 400, y: 700, width: 150, height: 25 }, lineItems: { x: 50, y: 200, width: 500, height: 400 } }; // Extract using coordinates (requires PDF library with positioning) async function extractWithTemplate(pdfPath, template) { // Using pdf2json or similar library with positional data const PDFParser = require('pdf2json'); const parser = new PDFParser(); return new Promise((resolve, reject) => { parser.on('pdfParser_dataReady', (pdfData) => { const extracted = {}; // Iterate through pages and text elements pdfData.formImage.Pages.forEach(page => { page.Texts.forEach(text => { const x = text.x; const y = text.y; // Check if text falls within any template zone Object.entries(template).forEach(([field, zone]) => { if (x >= zone.x && x <= zone.x + zone.width && y >= zone.y && y <= zone.y + zone.height) { const decodedText = decodeURIComponent(text.R[0].T); extracted[field] = (extracted[field] || '') + decodedText + ' '; } }); }); }); resolve(extracted); }); parser.on('pdfParser_dataError', reject); parser.loadPDF(pdfPath); }); }

OCR for Scanned Documents

When dealing with scanned documents or image-based PDFs, OCR is essential. Modern OCR engines have evolved significantly from simple character recognition to understanding document structure:

ocr-extraction.js
// Using Tesseract.js for client-side OCR const Tesseract = require('tesseract.js'); async function extractFromImage(imagePath) { const result = await Tesseract.recognize( imagePath, 'eng', { logger: m => console.log(m) } ); return { text: result.data.text, confidence: result.data.confidence, words: result.data.words }; } // Convert PDF pages to images, then OCR const pdf2pic = require('pdf2pic'); const path = require('path'); async function extractFromScannedPDF(pdfPath) { const converter = new pdf2pic.Pdf2Pic({ density: 300, // Higher DPI for better OCR accuracy savename: "page", savedir: "./temp", format: "png", width: 2480, // A4 at 300 DPI height: 3508 }); // Convert PDF to images const images = await converter.convertBulk(pdfPath, -1); // OCR each page const results = []; for (const image of images) { const ocrResult = await extractFromImage(image.path); results.push(ocrResult); } return results; }

AI-Powered Document Extraction

Modern AI approaches have revolutionized document extraction. Large language models and specialized document AI can understand context, handle variable layouts, and extract complex relationships without templates:

Layout-Aware Extraction

AI models like LayoutLM and Donut understand both text content and spatial relationships. They can identify tables, headers, and sections, making them ideal for complex documents with mixed content:

ai-extraction.js
// Using cloud AI services for document extraction const { DocumentProcessorServiceClient } = require('@google-cloud/documentai'); async function extractWithDocumentAI(filePath, processorId) { const client = new DocumentProcessorServiceClient(); const name = `projects/${projectId}/locations/us/processors/${processorId}`; const fs = require('fs'); const imageFile = fs.readFileSync(filePath); const request = { name, rawDocument: { content: imageFile.toString('base64'), mimeType: 'application/pdf', }, }; const [result] = await client.processDocument(request); const { document } = result; // Extract structured entities const entities = {}; document.entities.forEach(entity => { entities[entity.type] = { value: entity.mentionText, confidence: entity.confidence }; }); // Extract tables const tables = []; document.pages.forEach(page => { if (page.tables) { page.tables.forEach(table => { const rows = []; table.bodyRows.forEach(row => { const cells = row.cells.map(cell => cell.layout.textAnchor.content); rows.push(cells); }); tables.push(rows); }); } }); return { entities, tables, text: document.text }; }

LLM-Based Extraction

Large language models can extract information from documents using natural language instructions. This approach is incredibly flexible and works across document types without custom training:

llm-extraction.js
// Extract structured data using LLM async function extractWithLLM(documentText, extractionSchema) { const prompt = ` Extract the following information from this document. Return ONLY a JSON object matching this schema: ${JSON.stringify(extractionSchema, null, 2)} Document content: ${documentText.slice(0, 4000)} Extracted JSON: `; const response = await fetch('https://api.openai.com/v1/chat/completions', { method: 'POST', headers: { 'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`, 'Content-Type': 'application/json' }, body: JSON.stringify({ model: 'gpt-4', messages: [ { role: 'system', content: 'You are a document data extraction assistant. Extract structured data accurately from provided text.' }, { role: 'user', content: prompt } ], temperature: 0.1 }) }); const data = await response.json(); const jsonString = data.choices[0].message.content; // Parse the JSON response try { return JSON.parse(jsonString.replace(/```json\n?|```/g, '')); } catch (e) { console.error('Failed to parse LLM response:', jsonString); throw e; } } // Example usage const schema = { invoiceNumber: "string - the invoice identifier", vendorName: "string - company issuing the invoice", invoiceDate: "string - date in YYYY-MM-DD format", totalAmount: "number - total amount due", lineItems: "array of objects with description, quantity, and price" };

Handling Tables in Documents

Tables are particularly challenging because they span multiple lines and require understanding row/column relationships. Specialized approaches yield better results:

table-extraction.js
// Extract tables using Camelot (Python) or Tabula // Node.js wrapper example: const { PythonShell } = require('python-shell'); async function extractTables(pdfPath) { const script = ` import camelot import json tables = camelot.read_pdf('${pdfPath}', pages='all') results = [] for i, table in enumerate(tables): results.append({ 'page': table.page, 'accuracy': table.accuracy, 'data': table.df.to_dict('records') }) print(json.dumps(results)) `; const result = await PythonShell.runString(script, { mode: 'text', pythonPath: 'python3' }); return JSON.parse(result[0]); } // Alternative: Use cloud API for table extraction async function extractTablesWithAPI(pdfPath) { const formData = new FormData(); formData.append('file', fs.createReadStream(pdfPath)); const response = await fetch('https://api.papalily.com/v1/extract/tables', { method: 'POST', headers: { 'Authorization': `Bearer ${process.env.PAPALILY_API_KEY}` }, body: formData }); return await response.json(); }

Best Practices for Document Extraction

To maximize extraction accuracy and reliability:

Choosing the Right Approach

The best extraction method depends on your specific requirements. Here's a quick decision guide:

Document Extraction with Papalily

Papalily extends beyond web scraping to handle document data extraction. Our AI-powered platform can extract structured data from PDFs, images, and scanned documents with high accuracy. Whether you're processing invoices, forms, or reports, Papalily delivers clean, structured JSON ready for your applications.

Extract Data from Any Document

Transform PDFs and documents into structured data with Papalily's AI extraction. Handle invoices, forms, and reports automatically.

Get Free API Key on RapidAPI →

Full documentation at papalily.com/docs