Data Processing ETL Data Quality

Data Cleaning and Processing
After Web Scraping

📅 June 9, 2026 ⏱ 11 min read By Papalily Team

Web scraping is only half the battle. Once you've extracted data from websites, you're often left with a messy, unstructured collection of raw information. Inconsistent formats, missing values, duplicate entries, and noisy text are common challenges that can render your scraped data unusable without proper cleaning. This guide walks you through the essential steps to transform raw scraped data into clean, analysis-ready datasets.

Why Data Cleaning Matters

Raw scraped data is rarely perfect. Websites use different formatting conventions, dynamic content loads inconsistently, and HTML structures can vary across pages. Without cleaning, these issues compound into significant problems:

Studies suggest that data scientists spend 60-80% of their time on data preparation. Investing in robust cleaning pipelines upfront pays dividends in analysis speed and result accuracy.

The Data Cleaning Pipeline

A systematic approach to data cleaning follows the ETL (Extract, Transform, Load) pattern. After scraping (Extract), your cleaning pipeline (Transform) should address these key areas:

1. Initial Inspection and Profiling

Before making changes, understand what you're working with. Profile your dataset to identify:

data-profiling.js
const profileData = (data) => { const profile = { totalRows: data.length, columns: {}, duplicates: 0 }; if (data.length === 0) return profile; // Analyze each column const columns = Object.keys(data[0]); columns.forEach(col => { const values = data.map(row => row[col]); const nonNull = values.filter(v => v != null && v !== ''); profile.columns[col] = { type: inferType(nonNull), missingCount: values.length - nonNull.length, missingPct: ((values.length - nonNull.length) / values.length * 100).toFixed(1), uniqueCount: new Set(nonNull).size, sampleValues: nonNull.slice(0, 5) }; }); // Check for duplicates based on all fields const seen = new Set(); data.forEach(row => { const key = JSON.stringify(row); if (seen.has(key)) profile.duplicates++; seen.add(key); }); return profile; };

2. Handling Missing Values

Missing data is inevitable in web scraping. Pages load partially, fields are optional, and dynamic content fails to render. Your strategy depends on the missing data pattern:

Deletion Strategies

For rows or columns with excessive missing values, deletion may be appropriate:

handle-missing.js
// Remove rows where critical fields are missing const cleanData = data.filter(row => row.title && row.url && row.title.trim() !== '' ); // Remove columns with >50% missing values const threshold = 0.5; const colsToKeep = Object.keys(data[0]).filter(col => { const missingPct = data.filter(row => !row[col]).length / data.length; return missingPct < threshold; }); const reducedData = data.map(row => { const newRow = {}; colsToKeep.forEach(col => newRow[col] = row[col]); return newRow; });

Imputation Techniques

When deletion would lose too much data, impute (fill in) missing values:

imputation.js
// Fill numeric missing values with median const median = (arr) => { const sorted = arr.filter(x => x != null).sort((a, b) => a - b); const mid = Math.floor(sorted.length / 2); return sorted.length % 2 ? sorted[mid] : (sorted[mid - 1] + sorted[mid]) / 2; }; const priceMedian = median(data.map(row => row.price)); const imputedData = data.map(row => ({ ...row, price: row.price ?? priceMedian, // Fill text fields with placeholder description: row.description || 'No description available', // Forward fill for time-series data timestamp: row.timestamp || lastKnownTimestamp })); // Mark imputed values for tracking const trackedData = data.map(row => ({ ...row, price: row.price ?? priceMedian, price_was_imputed: row.price == null }));

Removing Duplicates

Duplicate records are common in web scraping due to pagination overlaps, repeated crawls, or inconsistent URL structures. Effective deduplication requires identifying what makes a record unique:

deduplication.js
// Remove exact duplicates based on all fields const uniqueExact = (data) => { const seen = new Set(); return data.filter(row => { const key = JSON.stringify(row); if (seen.has(key)) return false; seen.add(key); return true; }); }; // Remove duplicates based on specific key fields const uniqueByKey = (data, keyFields) => { const seen = new Set(); return data.filter(row => { const key = keyFields.map(f => row[f]).join('|'); if (seen.has(key)) return false; seen.add(key); return true; }); }; // Usage: deduplicate by product ID const uniqueProducts = uniqueByKey(data, ['product_id']); // Fuzzy deduplication for near-matches (e.g., similar titles) const similarity = (a, b) => { // Simple Jaccard similarity for strings const setA = new Set(a.toLowerCase().split(' ')); const setB = new Set(b.toLowerCase().split(' ')); const intersection = new Set([...setA].filter(x => setB.has(x))); return intersection.size / (setA.size + setB.size - intersection.size); };

Standardizing and Normalizing Data

Web data comes in inconsistent formats. Prices may include currency symbols, dates use different conventions, and text contains varying whitespace. Standardization creates uniformity:

Text Cleaning

text-cleaning.js
const cleanText = (text) => { if (!text) return ''; return text .trim() // Remove leading/trailing whitespace .replace(/\s+/g, ' ') // Normalize multiple spaces .replace(/[\u200B-\u200D\uFEFF]/g, '') // Remove zero-width characters .replace(/&/g, '&') // Decode HTML entities .replace(/</g, '<') .replace(/>/g, '>') .replace(/\n+/g, ' '); // Normalize newlines }; // Normalize whitespace and case for comparison const normalizeForCompare = (text) => cleanText(text).toLowerCase().replace(/[^a-z0-9]/g, '');

Numeric Standardization

numeric-cleaning.js
const parsePrice = (priceStr) => { if (typeof priceStr === 'number') return priceStr; if (!priceStr) return null; // Remove currency symbols and whitespace, keep decimal point const cleaned = priceStr .replace(/[$€£¥]/g, '') .replace(/,/g, '') // Remove thousand separators .trim(); const parsed = parseFloat(cleaned); return isNaN(parsed) ? null : parsed; }; // Extract numeric rating from text like "4.5 out of 5 stars" const parseRating = (ratingStr) => { const match = ratingStr?.match(/([\d.]+)/); return match ? parseFloat(match[1]) : null; };

Date Normalization

date-normalization.js
const parseDate = (dateStr) => { if (!dateStr) return null; // Try common date formats const formats = [ /^\d{4}-\d{2}-\d{2}/, // 2026-06-09 /^\d{2}\/\d{2}\/\d{4}/, // 06/09/2026 /^\d{1,2}\s+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)/i, // 9 June /^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d{1,2}/i // June 9 ]; const date = new Date(dateStr); if (!isNaN(date.getTime())) { return date.toISOString().split('T')[0]; // YYYY-MM-DD } return null; }; // Relative date parsing (e.g., "2 days ago", "last week") const parseRelativeDate = (text) => { const now = new Date(); if (text.includes('day ago') || text.includes('days ago')) { const days = parseInt(text.match(/\d+/)?.[0] || 1); now.setDate(now.getDate() - days); return now.toISOString().split('T')[0]; } return null; };

Validating Data Quality

After cleaning, validate that your data meets quality standards. Define rules specific to your use case:

validation.js
const validators = { url: (val) => /^https?:\/\/.+/.test(val), email: (val) => /^[^\s@]+@[^\s@]+\.[^\s@]+$/.test(val), price: (val) => typeof val === 'number' && val >= 0, rating: (val) => typeof val === 'number' && val >= 0 && val <= 5 }; const validateRow = (row, rules) => { const errors = []; Object.entries(rules).forEach(([field, validator]) => { if (row[field] != null && !validator(row[field])) { errors.push({ field, value: row[field], message: `Invalid ${field}` }); } }); return { isValid: errors.length === 0, errors }; }; // Validate entire dataset const validateDataset = (data, rules) => { const results = data.map(row => ({ ...row, ...validateRow(row, rules) })); return { valid: results.filter(r => r.isValid), invalid: results.filter(r => !r.isValid), stats: { total: data.length, validCount: results.filter(r => r.isValid).length, invalidCount: results.filter(r => !r.isValid).length } }; };

Structuring for Storage

Clean data should be structured consistently for database insertion or API responses. Ensure type consistency and handle nested objects:

structuring.js
const structureForStorage = (row) => ({ id: row.id || generateId(), title: cleanText(row.title), description: cleanText(row.description), url: row.url, price: parsePrice(row.price), currency: extractCurrency(row.price) || 'USD', rating: parseRating(row.rating), review_count: parseInt(row.reviews) || 0, scraped_at: new Date().toISOString(), source_domain: new URL(row.url).hostname, // Flatten nested objects for relational databases category: row.category?.name || row.category, category_id: row.category?.id, // Store original raw data for debugging _raw: JSON.stringify(row) });

Building a Reusable Cleaning Pipeline

For ongoing scraping operations, build a reusable pipeline that applies consistent transformations:

pipeline.js
class DataCleaningPipeline { constructor(config) { this.config = { requiredFields: config.requiredFields || [], dedupKeys: config.dedupKeys || [], validations: config.validations || {}, imputationRules: config.imputationRules || {}, ...config }; } async process(rawData) { let data = [...rawData]; // Step 1: Profile const profile = profileData(data); console.log('Data profile:', profile); // Step 2: Remove rows missing required fields data = data.filter(row => this.config.requiredFields.every(field => row[field]) ); // Step 3: Clean text fields data = data.map(row => { const cleaned = { ...row }; Object.keys(row).forEach(key => { if (typeof row[key] === 'string') { cleaned[key] = cleanText(row[key]); } }); return cleaned; }); // Step 4: Deduplicate if (this.config.dedupKeys.length > 0) { data = uniqueByKey(data, this.config.dedupKeys); } // Step 5: Impute missing values data = data.map(row => { const imputed = { ...row }; Object.entries(this.config.imputationRules).forEach(([field, rule]) => { if (imputed[field] == null) { imputed[field] = typeof rule === 'function' ? rule(data) : rule; } }); return imputed; }); // Step 6: Validate const { valid, invalid, stats } = validateDataset(data, this.config.validations); return { valid, invalid, stats, profile }; } }

Best Practices for Data Cleaning

To maintain high data quality in your scraping operations:

Automating with Papalily

Papalily handles much of the data cleaning burden automatically. Our AI-powered extraction delivers structured, consistent data that requires minimal post-processing. With built-in normalization for prices, dates, and common formats, you can focus on analysis rather than cleanup.

Get Clean Data with Papalily

Skip the data cleaning headache. Papalily's AI extraction delivers structured, normalized data ready for immediate use in your applications.

Get Free API Key on RapidAPI →

Full documentation at papalily.com/docs