Data Cleaning and Processing After Web Scraping

Web scraping is only half the battle. Once you've extracted data from websites, you're often left with a messy, unstructured collection of raw information. Inconsistent formats, missing values, duplicate entries, and noisy text are common challenges that can render your scraped data unusable without proper cleaning. This guide walks you through the essential steps to transform raw scraped data into clean, analysis-ready datasets.

Why Data Cleaning Matters

Raw scraped data is rarely perfect. Websites use different formatting conventions, dynamic content loads inconsistently, and HTML structures can vary across pages. Without cleaning, these issues compound into significant problems:

Inaccurate analysis: Duplicate records skew statistics and aggregations
Failed integrations: Inconsistent formats break database imports and API connections
Missed insights: Missing or malformed data hides valuable patterns
Wasted resources: Dirty data requires repeated processing and manual intervention

Studies suggest that data scientists spend 60-80% of their time on data preparation. Investing in robust cleaning pipelines upfront pays dividends in analysis speed and result accuracy.

The Data Cleaning Pipeline

A systematic approach to data cleaning follows the ETL (Extract, Transform, Load) pattern. After scraping (Extract), your cleaning pipeline (Transform) should address these key areas:

1. Initial Inspection and Profiling

Before making changes, understand what you're working with. Profile your dataset to identify:

Column data types and inconsistencies
Missing value patterns and frequencies
Duplicate record counts
Outliers and anomalous values
Text encoding issues

const profileData = (data) => {
  const profile = {
    totalRows: data.length,
    columns: {},
    duplicates: 0
  };

  if (data.length === 0) return profile;

  // Analyze each column
  const columns = Object.keys(data[0]);
  columns.forEach(col => {
    const values = data.map(row => row[col]);
    const nonNull = values.filter(v => v != null && v !== '');
    
    profile.columns[col] = {
      type: inferType(nonNull),
      missingCount: values.length - nonNull.length,
      missingPct: ((values.length - nonNull.length) / values.length * 100).toFixed(1),
      uniqueCount: new Set(nonNull).size,
      sampleValues: nonNull.slice(0, 5)
    };
  });

  // Check for duplicates based on all fields
  const seen = new Set();
  data.forEach(row => {
    const key = JSON.stringify(row);
    if (seen.has(key)) profile.duplicates++;
    seen.add(key);
  });

  return profile;
};

2. Handling Missing Values

Missing data is inevitable in web scraping. Pages load partially, fields are optional, and dynamic content fails to render. Your strategy depends on the missing data pattern:

Deletion Strategies

For rows or columns with excessive missing values, deletion may be appropriate:

// Remove rows where critical fields are missing
const cleanData = data.filter(row => 
  row.title && row.url && row.title.trim() !== ''
);

// Remove columns with >50% missing values
const threshold = 0.5;
const colsToKeep = Object.keys(data[0]).filter(col => {
  const missingPct = data.filter(row => !row[col]).length / data.length;
  return missingPct < threshold;
});

const reducedData = data.map(row => {
  const newRow = {};
  colsToKeep.forEach(col => newRow[col] = row[col]);
  return newRow;
});

Imputation Techniques

When deletion would lose too much data, impute (fill in) missing values:

// Fill numeric missing values with median
const median = (arr) => {
  const sorted = arr.filter(x => x != null).sort((a, b) => a - b);
  const mid = Math.floor(sorted.length / 2);
  return sorted.length % 2 ? sorted[mid] : (sorted[mid - 1] + sorted[mid]) / 2;
};

const priceMedian = median(data.map(row => row.price));

const imputedData = data.map(row => ({
  ...row,
  price: row.price ?? priceMedian,
  // Fill text fields with placeholder
  description: row.description || 'No description available',
  // Forward fill for time-series data
  timestamp: row.timestamp || lastKnownTimestamp
}));

// Mark imputed values for tracking
const trackedData = data.map(row => ({
  ...row,
  price: row.price ?? priceMedian,
  price_was_imputed: row.price == null
}));

Removing Duplicates

Duplicate records are common in web scraping due to pagination overlaps, repeated crawls, or inconsistent URL structures. Effective deduplication requires identifying what makes a record unique:

// Remove exact duplicates based on all fields
const uniqueExact = (data) => {
  const seen = new Set();
  return data.filter(row => {
    const key = JSON.stringify(row);
    if (seen.has(key)) return false;
    seen.add(key);
    return true;
  });
};

// Remove duplicates based on specific key fields
const uniqueByKey = (data, keyFields) => {
  const seen = new Set();
  return data.filter(row => {
    const key = keyFields.map(f => row[f]).join('|');
    if (seen.has(key)) return false;
    seen.add(key);
    return true;
  });
};

// Usage: deduplicate by product ID
const uniqueProducts = uniqueByKey(data, ['product_id']);

// Fuzzy deduplication for near-matches (e.g., similar titles)
const similarity = (a, b) => {
  // Simple Jaccard similarity for strings
  const setA = new Set(a.toLowerCase().split(' '));
  const setB = new Set(b.toLowerCase().split(' '));
  const intersection = new Set([...setA].filter(x => setB.has(x)));
  return intersection.size / (setA.size + setB.size - intersection.size);
};

Standardizing and Normalizing Data

Web data comes in inconsistent formats. Prices may include currency symbols, dates use different conventions, and text contains varying whitespace. Standardization creates uniformity:

Text Cleaning

const cleanText = (text) => {
  if (!text) return '';
  
  return text
    .trim()                                    // Remove leading/trailing whitespace
    .replace(/\s+/g, ' ')                     // Normalize multiple spaces
    .replace(/[\u200B-\u200D\uFEFF]/g, '')     // Remove zero-width characters
    .replace(/&/g, '&')                     // Decode HTML entities
    .replace(/</g, '<')
    .replace(/>/g, '>')
    .replace(/\n+/g, ' ');                    // Normalize newlines
};

// Normalize whitespace and case for comparison
const normalizeForCompare = (text) => 
  cleanText(text).toLowerCase().replace(/[^a-z0-9]/g, '');

Numeric Standardization

const parsePrice = (priceStr) => {
  if (typeof priceStr === 'number') return priceStr;
  if (!priceStr) return null;
  
  // Remove currency symbols and whitespace, keep decimal point
  const cleaned = priceStr
    .replace(/[$€£¥]/g, '')
    .replace(/,/g, '')     // Remove thousand separators
    .trim();
  
  const parsed = parseFloat(cleaned);
  return isNaN(parsed) ? null : parsed;
};

// Extract numeric rating from text like "4.5 out of 5 stars"
const parseRating = (ratingStr) => {
  const match = ratingStr?.match(/([\d.]+)/);
  return match ? parseFloat(match[1]) : null;
};

Date Normalization

const parseDate = (dateStr) => {
  if (!dateStr) return null;
  
  // Try common date formats
  const formats = [
    /^\d{4}-\d{2}-\d{2}/,           // 2026-06-09
    /^\d{2}\/\d{2}\/\d{4}/,         // 06/09/2026
    /^\d{1,2}\s+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)/i,  // 9 June
    /^(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d{1,2}/i   // June 9
  ];
  
  const date = new Date(dateStr);
  if (!isNaN(date.getTime())) {
    return date.toISOString().split('T')[0]; // YYYY-MM-DD
  }
  
  return null;
};

// Relative date parsing (e.g., "2 days ago", "last week")
const parseRelativeDate = (text) => {
  const now = new Date();
  
  if (text.includes('day ago') || text.includes('days ago')) {
    const days = parseInt(text.match(/\d+/)?.[0] || 1);
    now.setDate(now.getDate() - days);
    return now.toISOString().split('T')[0];
  }
  
  return null;
};

Validating Data Quality

After cleaning, validate that your data meets quality standards. Define rules specific to your use case:

const validators = {
  url: (val) => /^https?:\/\/.+/.test(val),
  email: (val) => /^[^\s@]+@[^\s@]+\.[^\s@]+$/.test(val),
  price: (val) => typeof val === 'number' && val >= 0,
  rating: (val) => typeof val === 'number' && val >= 0 && val <= 5
};

const validateRow = (row, rules) => {
  const errors = [];
  
  Object.entries(rules).forEach(([field, validator]) => {
    if (row[field] != null && !validator(row[field])) {
      errors.push({ field, value: row[field], message: `Invalid ${field}` });
    }
  });
  
  return { isValid: errors.length === 0, errors };
};

// Validate entire dataset
const validateDataset = (data, rules) => {
  const results = data.map(row => ({
    ...row,
    ...validateRow(row, rules)
  }));
  
  return {
    valid: results.filter(r => r.isValid),
    invalid: results.filter(r => !r.isValid),
    stats: {
      total: data.length,
      validCount: results.filter(r => r.isValid).length,
      invalidCount: results.filter(r => !r.isValid).length
    }
  };
};

Structuring for Storage

Clean data should be structured consistently for database insertion or API responses. Ensure type consistency and handle nested objects:

const structureForStorage = (row) => ({
  id: row.id || generateId(),
  title: cleanText(row.title),
  description: cleanText(row.description),
  url: row.url,
  price: parsePrice(row.price),
  currency: extractCurrency(row.price) || 'USD',
  rating: parseRating(row.rating),
  review_count: parseInt(row.reviews) || 0,
  scraped_at: new Date().toISOString(),
  source_domain: new URL(row.url).hostname,
  
  // Flatten nested objects for relational databases
  category: row.category?.name || row.category,
  category_id: row.category?.id,
  
  // Store original raw data for debugging
  _raw: JSON.stringify(row)
});

Building a Reusable Cleaning Pipeline

For ongoing scraping operations, build a reusable pipeline that applies consistent transformations:

class DataCleaningPipeline {
  constructor(config) {
    this.config = {
      requiredFields: config.requiredFields || [],
      dedupKeys: config.dedupKeys || [],
      validations: config.validations || {},
      imputationRules: config.imputationRules || {},
      ...config
    };
  }

  async process(rawData) {
    let data = [...rawData];
    
    // Step 1: Profile
    const profile = profileData(data);
    console.log('Data profile:', profile);
    
    // Step 2: Remove rows missing required fields
    data = data.filter(row => 
      this.config.requiredFields.every(field => row[field])
    );
    
    // Step 3: Clean text fields
    data = data.map(row => {
      const cleaned = { ...row };
      Object.keys(row).forEach(key => {
        if (typeof row[key] === 'string') {
          cleaned[key] = cleanText(row[key]);
        }
      });
      return cleaned;
    });
    
    // Step 4: Deduplicate
    if (this.config.dedupKeys.length > 0) {
      data = uniqueByKey(data, this.config.dedupKeys);
    }
    
    // Step 5: Impute missing values
    data = data.map(row => {
      const imputed = { ...row };
      Object.entries(this.config.imputationRules).forEach(([field, rule]) => {
        if (imputed[field] == null) {
          imputed[field] = typeof rule === 'function' ? rule(data) : rule;
        }
      });
      return imputed;
    });
    
    // Step 6: Validate
    const { valid, invalid, stats } = validateDataset(data, this.config.validations);
    
    return { valid, invalid, stats, profile };
  }
}

Best Practices for Data Cleaning

To maintain high data quality in your scraping operations:

Clean early and often: Don't let dirty data accumulate. Clean at extraction time
Preserve originals: Keep raw data alongside cleaned versions for debugging
Document transformations: Track what changes were made and why
Monitor quality metrics: Track completeness, validity, and uniqueness over time
Handle edge cases: Test your cleaners with unusual inputs (emojis, special characters, nulls)
Use type coercion carefully: Be explicit about type conversions to avoid data loss
Validate after cleaning: Always run quality checks on the final output

Automating with Papalily

Papalily handles much of the data cleaning burden automatically. Our AI-powered extraction delivers structured, consistent data that requires minimal post-processing. With built-in normalization for prices, dates, and common formats, you can focus on analysis rather than cleanup.

Get Clean Data with Papalily

Skip the data cleaning headache. Papalily's AI extraction delivers structured, normalized data ready for immediate use in your applications.

Get Free API Key on RapidAPI →

Full documentation at papalily.com/docs