Web Scraping for Sentiment Analysis and Brand Monitoring: 2026 Guide

In today's hyper-connected digital landscape, a brand's reputation can shift in minutes. A single viral tweet, a scathing product review, or an unexpected news mention can dramatically impact customer perception and business outcomes. Sentiment analysis powered by web scraping has emerged as a critical capability for organizations seeking to monitor, understand, and respond to public opinion at scale. By systematically extracting and analyzing customer reviews, social media conversations, news articles, and forum discussions, businesses can gain real-time intelligence about how their brand, products, and competitors are perceived across the digital ecosystem.

The Business Case for Sentiment Intelligence

Modern brand monitoring extends far beyond tracking mentions. Organizations leveraging web scraping for sentiment analysis gain strategic advantages across multiple dimensions:

Crisis detection and response: Identify negative sentiment spikes in real-time, enabling rapid response before issues escalate
Product development insights: Analyze customer feedback patterns to prioritize feature improvements and identify pain points
Competitive positioning: Benchmark sentiment against competitors and identify market opportunities
Campaign effectiveness: Measure sentiment shifts before, during, and after marketing initiatives
Influencer identification: Discover brand advocates and detractors with significant reach
Market trend anticipation: Detect emerging sentiment patterns that signal shifting consumer preferences

The global social media analytics market, which includes sentiment analysis capabilities, is projected to reach $15.6 billion by 2028, reflecting the growing recognition that understanding public sentiment is not optional—it's essential for competitive survival.

Data Sources for Comprehensive Sentiment Analysis

Effective sentiment intelligence requires aggregating data from diverse sources, each offering unique perspectives on brand perception:

Primary Sentiment Data Sources

Review Platforms Trustpilot, G2, Capterra, Amazon, Yelp, App Store, Google Play

Social Networks Twitter/X, Reddit, LinkedIn, Facebook, Instagram, TikTok

News & Media Google News, industry publications, press releases, blogs

Forums & Communities Quora, Stack Overflow, niche community forums, Discord

Video Platforms YouTube comments, video descriptions, transcript analysis

Internal Sources Support tickets, chat logs, survey responses, NPS data

Scraping Customer Reviews at Scale

Review platforms contain structured, high-value sentiment data that directly reflects customer experiences. Here's how to build a comprehensive review scraping system:

1. Multi-Platform Review Aggregation

import asyncio
from datetime import datetime, timedelta
from papalily import scrape  # AI-powered scraping API

class ReviewScraper:
    def __init__(self, api_key):
        self.api_key = api_key
        self.platforms = {
            'trustpilot': {
                'base_url': 'https://www.trustpilot.com',
                'review_selector': '.review-card',
                'pagination': '?page={page}'
            },
            'g2': {
                'base_url': 'https://www.g2.com',
                'review_selector': '.review',
                'pagination': '?page={page}'
            },
            'capterra': {
                'base_url': 'https://www.capterra.com',
                'review_selector': '.review-card',
                'pagination': '?page={page}'
            },
            'amazon': {
                'base_url': 'https://www.amazon.com',
                'review_selector': '[data-hook="review"]',
                'pagination': '&pageNumber={page}'
            }
        }
    
    def scrape_reviews(self, platform, company_name, product_name=None, 
                       max_pages=10, date_range=None):
        """Scrape reviews from a specific platform"""
        config = self.platforms.get(platform)
        if not config:
            raise ValueError(f"Unsupported platform: {platform}")
        
        all_reviews = []
        
        for page in range(1, max_pages + 1):
            # Construct search URL based on platform
            if platform == 'trustpilot':
                url = f"{config['base_url']}/review/{company_name.lower().replace(' ', '-')}{config['pagination'].format(page=page)}"
            elif platform == 'g2':
                url = f"{config['base_url']}/products/{company_name.lower().replace(' ', '-')}/reviews{config['pagination'].format(page=page)}"
            elif platform == 'amazon':
                asin = product_name  # ASIN for Amazon products
                url = f"{config['base_url']}/product-reviews/{asin}?sortBy=recent{config['pagination'].format(page=page)}"
            else:
                url = f"{config['base_url']}/reviews/{company_name}{config['pagination'].format(page=page)}"
            
            try:
                data = scrape(
                    url=url,
                    api_key=self.api_key,
                    extract_schema={
                        'reviews': {
                            'selector': config['review_selector'],
                            'type': 'list',
                            'fields': {
                                'reviewer_name': '.reviewer-name, .author-name, [data-hook="review-author"]',
                                'rating': '.star-rating, .rating, [data-hook="review-star-rating"]',
                                'title': '.review-title, .title, [data-hook="review-title"]',
                                'content': '.review-content, .text, [data-hook="review-body"]',
                                'date': '.review-date, .date, [data-hook="review-date"]',
                                'verified': '.verified-badge, .verified-purchase',
                                'helpful_votes': '.helpful-count, .votes',
                                'reviewer_location': '.reviewer-location, .location'
                            }
                        },
                        'total_reviews': '.review-count, .total-reviews',
                        'average_rating': '.average-rating, .overall-rating'
                    },
                    wait_for=config['review_selector']
                )
                
                reviews = data.get('reviews', [])
                if not reviews:
                    break
                
                # Process and filter reviews
                for review in reviews:
                    processed = self._process_review(review, platform, company_name)
                    
                    # Apply date filter if specified
                    if date_range and processed.get('date'):
                        review_date = self._parse_date(processed['date'])
                        if review_date and (review_date < date_range['start'] 
                                          or review_date > date_range['end']):
                            continue
                    
                    all_reviews.append(processed)
                
                # Check if we've reached the end
                if len(reviews) < 10:  # Most platforms show 10+ reviews per page
                    break
                    
            except Exception as e:
                print(f"Error scraping {platform} page {page}: {e}")
                break
        
        return {
            'platform': platform,
            'company': company_name,
            'reviews': all_reviews,
            'total_scraped': len(all_reviews),
            'scraped_at': datetime.utcnow().isoformat()
        }
    
    def _process_review(self, review, platform, company):
        """Normalize and enrich review data"""
        # Extract numeric rating
        rating_text = review.get('rating', '')
        rating = self._extract_rating(rating_text)
        
        # Parse date
        date_text = review.get('date', '')
        parsed_date = self._parse_date(date_text)
        
        # Determine sentiment label
        sentiment = self._categorize_sentiment(rating)
        
        return {
            'platform': platform,
            'company': company,
            'reviewer_name': review.get('reviewer_name', 'Anonymous'),
            'rating': rating,
            'rating_text': rating_text,
            'sentiment': sentiment,
            'title': review.get('title', ''),
            'content': review.get('content', ''),
            'date': parsed_date.isoformat() if parsed_date else date_text,
            'verified_purchase': bool(review.get('verified')),
            'helpful_votes': self._extract_number(review.get('helpful_votes', '0')),
            'reviewer_location': review.get('reviewer_location', ''),
            'word_count': len(review.get('content', '').split())
        }
    
    def _extract_rating(self, rating_text):
        """Extract numeric rating from text"""
        import re
        # Match patterns like "5 stars", "4.5", "★★★★☆"
        patterns = [
            r'(\d+\.?\d*)\s*stars?',
            r'(\d+\.?\d*)\s*out of',
            r'★+',  # Count star symbols
            r'(\d+\.?\d*)'  # Generic number
        ]
        
        for pattern in patterns:
            match = re.search(pattern, str(rating_text), re.IGNORECASE)
            if match:
                if '★' in match.group():
                    return len(match.group())
                return float(match.group(1))
        return None
    
    def _categorize_sentiment(self, rating):
        """Categorize sentiment based on rating"""
        if rating is None:
            return 'neutral'
        if rating >= 4:
            return 'positive'
        elif rating <= 2:
            return 'negative'
        else:
            return 'neutral'
    
    def _parse_date(self, date_text):
        """Parse various date formats"""
        if not date_text:
            return None
        
        formats = [
            '%B %d, %Y',
            '%Y-%m-%d',
            '%d %B %Y',
            '%m/%d/%Y',
            '%d/%m/%Y',
            '%Y-%m-%dT%H:%M:%S',
            '%b %d, %Y'
        ]
        
        for fmt in formats:
            try:
                return datetime.strptime(date_text.strip(), fmt)
            except ValueError:
                continue
        return None
    
    def _extract_number(self, text):
        """Extract numeric value from text"""
        import re
        match = re.search(r'\d+', str(text).replace(',', ''))
        return int(match.group()) if match else 0
    
    def aggregate_reviews(self, company_name, platforms=None, max_pages=5):
        """Scrape reviews from multiple platforms"""
        platforms = platforms or ['trustpilot', 'g2', 'capterra']
        all_results = []
        
        for platform in platforms:
            try:
                result = self.scrape_reviews(platform, company_name, max_pages=max_pages)
                all_results.append(result)
            except Exception as e:
                print(f"Failed to scrape {platform}: {e}")
                all_results.append({
                    'platform': platform,
                    'company': company_name,
                    'error': str(e),
                    'reviews': []
                })
        
        return self._compile_sentiment_summary(all_results)
    
    def _compile_sentiment_summary(self, results):
        """Compile cross-platform sentiment analysis"""
        all_reviews = []
        platform_stats = {}
        
        for result in results:
            platform = result['platform']
            reviews = result.get('reviews', [])
            all_reviews.extend(reviews)
            
            if reviews:
                ratings = [r['rating'] for r in reviews if r['rating']]
                sentiments = [r['sentiment'] for r in reviews]
                
                platform_stats[platform] = {
                    'total_reviews': len(reviews),
                    'average_rating': sum(ratings) / len(ratings) if ratings else 0,
                    'positive_pct': sentiments.count('positive') / len(sentiments) * 100,
                    'negative_pct': sentiments.count('negative') / len(sentiments) * 100,
                    'neutral_pct': sentiments.count('neutral') / len(sentiments) * 100
                }
        
        # Overall statistics
        all_ratings = [r['rating'] for r in all_reviews if r['rating']]
        all_sentiments = [r['sentiment'] for r in all_reviews]
        
        return {
            'company': results[0]['company'] if results else None,
            'total_reviews': len(all_reviews),
            'platforms_covered': list(platform_stats.keys()),
            'overall_rating': round(sum(all_ratings) / len(all_ratings), 2) if all_ratings else 0,
            'sentiment_distribution': {
                'positive': all_sentiments.count('positive'),
                'negative': all_sentiments.count('negative'),
                'neutral': all_sentiments.count('neutral')
            },
            'platform_breakdown': platform_stats,
            'recent_reviews': sorted(
                all_reviews,
                key=lambda x: x.get('date', ''),
                reverse=True
            )[:20],
            'scraped_at': datetime.utcnow().isoformat()
        }

2. Review Content Analysis and Keyword Extraction

Beyond ratings, the actual content of reviews contains rich insights about specific aspects of products and services:

from collections import Counter
import re

class ReviewAnalyzer:
    def __init__(self):
        self.aspect_keywords = {
            'customer_service': ['support', 'service', 'help', 'staff', 'team', 'response'],
            'pricing': ['price', 'cost', 'expensive', 'cheap', 'value', 'money', 'worth'],
            'quality': ['quality', 'build', 'durable', 'reliable', 'broken', 'defect'],
            'usability': ['easy', 'simple', 'intuitive', 'difficult', 'complicated', 'user-friendly'],
            'features': ['feature', 'functionality', 'option', 'capability', 'missing', 'lacking'],
            'performance': ['fast', 'slow', 'speed', 'performance', 'lag', 'responsive'],
            'design': ['design', 'look', 'appearance', 'beautiful', 'ugly', 'interface']
        }
    
    def analyze_review_topics(self, reviews):
        """Extract topics and aspects mentioned in reviews"""
        aspect_mentions = {aspect: [] for aspect in self.aspect_keywords}
        
        for review in reviews:
            content = review.get('content', '').lower()
            sentiment = review.get('sentiment', 'neutral')
            
            for aspect, keywords in self.aspect_keywords.items():
                for keyword in keywords:
                    if keyword in content:
                        # Extract context around keyword
                        context = self._extract_context(content, keyword)
                        aspect_mentions[aspect].append({
                            'review_id': review.get('id'),
                            'sentiment': sentiment,
                            'keyword': keyword,
                            'context': context,
                            'rating': review.get('rating')
                        })
                        break
        
        return aspect_mentions
    
    def _extract_context(self, text, keyword, window=50):
        """Extract text around keyword"""
        pattern = re.compile(r'.{0,%d}\b%s\b.{0,%d}' % (window, re.escape(keyword)), 
                            re.IGNORECASE)
        matches = pattern.findall(text)
        return matches[0] if matches else ''
    
    def extract_key_phrases(self, reviews, sentiment_filter=None):
        """Extract frequently mentioned phrases"""
        if sentiment_filter:
            reviews = [r for r in reviews if r.get('sentiment') == sentiment_filter]
        
        all_text = ' '.join([r.get('content', '') for r in reviews]).lower()
        
        # Extract bigrams and trigrams
        words = re.findall(r'\b\w+\b', all_text)
        
        bigrams = [' '.join(words[i:i+2]) for i in range(len(words)-1)]
        trigrams = [' '.join(words[i:i+3]) for i in range(len(words)-2)]
        
        # Filter out common stop words
        stop_words = {'the', 'and', 'for', 'are', 'but', 'not', 'you', 'all', 
                      'can', 'had', 'her', 'was', 'one', 'our', 'out', 'day', 
                      'get', 'has', 'him', 'his', 'how', 'its', 'may', 'new', 
                      'now', 'old', 'see', 'two', 'who', 'boy', 'did', 'she', 
                      'use', 'her', 'way', 'many', 'oil', 'sit', 'set', 'run', 
                      'eat', 'far', 'sea', 'eye', 'ago', 'off', 'too', 'any', 
                      'try', 'ask', 'end', 'why', 'let', 'put', 'say', 'she', 
                      'try', 'way', 'own', 'say', 'too', 'old', 'tell', 'very', 
                      'when', 'much', 'would', 'there', 'their', 'what', 'said', 
                      'each', 'which', 'will', 'about', 'could', 'other', 'after', 
                      'first', 'never', 'these', 'think', 'where', 'being', 'every', 
                      'great', 'might', 'shall', 'still', 'those', 'while', 'this', 
                      'that', 'with', 'have', 'from', 'they', 'know', 'want', 'been', 
                      'good', 'much', 'some', 'time', 'very', 'when', 'come', 'here', 
                      'just', 'like', 'long', 'make', 'many', 'over', 'such', 'take', 
                      'than', 'them', 'well', 'were'}
        
        filtered_bigrams = [b for b in bigrams 
                           if not any(w in stop_words for w in b.split())]
        filtered_trigrams = [t for t in trigrams 
                            if not any(w in stop_words for w in t.split())]
        
        return {
            'top_bigrams': Counter(filtered_bigrams).most_common(20),
            'top_trigrams': Counter(filtered_trigrams).most_common(20)
        }
    
    def identify_emerging_issues(self, reviews, days_back=7):
        """Identify recently emerging negative themes"""
        from datetime import datetime, timedelta
        
        cutoff_date = datetime.now() - timedelta(days=days_back)
        
        recent_negative = [
            r for r in reviews 
            if r.get('sentiment') == 'negative' 
            and r.get('date') 
            and datetime.fromisoformat(r['date'].replace('Z', '+00:00')) > cutoff_date
        ]
        
        # Compare with historical baseline
        older_negative = [
            r for r in reviews 
            if r.get('sentiment') == 'negative' 
            and r.get('date') 
            and datetime.fromisoformat(r['date'].replace('Z', '+00:00')) <= cutoff_date
        ]
        
        recent_phrases = self.extract_key_phrases(recent_negative)
        older_phrases = self.extract_key_phrases(older_negative)
        
        # Find phrases increasing in frequency
        recent_counts = dict(recent_phrases['top_bigrams'])
        older_counts = dict(older_phrases['top_bigrams'])
        
        emerging = []
        for phrase, count in recent_counts.items():
            baseline = older_counts.get(phrase, 0)
            if baseline == 0 or count / max(baseline, 1) > 2:  # 2x increase
                emerging.append({
                    'phrase': phrase,
                    'recent_count': count,
                    'baseline_count': baseline,
                    'increase_factor': count / max(baseline, 1)
                })
        
        return sorted(emerging, key=lambda x: x['increase_factor'], reverse=True)

Social Media Sentiment Monitoring

Social platforms capture unfiltered, real-time opinions that traditional review sites miss. Here's how to monitor brand sentiment across social channels:

class SocialMediaMonitor:
    def __init__(self, api_key):
        self.api_key = api_key
    
    def monitor_reddit_mentions(self, brand_names, subreddits=None, days_back=7):
        """Monitor Reddit discussions about brands"""
        from datetime import datetime, timedelta
        
        cutoff_date = datetime.now() - timedelta(days=days_back)
        all_mentions = []
        
        # Search across relevant subreddits
        target_subreddits = subreddits or [
            'business', 'marketing', 'startups', 'technology',
            'webdev', 'programming', 'SaaS', 'Entrepreneur'
        ]
        
        for brand in brand_names:
            for subreddit in target_subreddits:
                url = f"https://www.reddit.com/r/{subreddit}/search/?q={brand.replace(' ', '%20')}&sort=new"
                
                try:
                    data = scrape(
                        url=url,
                        api_key=self.api_key,
                        extract_schema={
                            'posts': {
                                'selector': '.Post',
                                'type': 'list',
                                'fields': {
                                    'title': 'h3',
                                    'content': '[data-click-id="text"]',
                                    'author': '[data-testid="post_author_link"]',
                                    'upvotes': '[data-testid="post-container"] [data-testid="upvote-button"] + div',
                                    'comment_count': '[data-testid="post-container"] [data-testid="comment-button"] + span',
                                    'post_date': '[data-testid="post_timestamp"]',
                                    'post_url': {'selector': 'a[data-click-id="body"]', 'attribute': 'href'}
                                }
                            }
                        },
                        wait_for='.Post'
                    )
                    
                    for post in data.get('posts', []):
                        mention = {
                            'platform': 'reddit',
                            'brand': brand,
                            'subreddit': subreddit,
                            'title': post.get('title', ''),
                            'content': post.get('content', ''),
                            'author': post.get('author', ''),
                            'upvotes': self._extract_number(post.get('upvotes', '0')),
                            'comment_count': self._extract_number(post.get('comment_count', '0')),
                            'engagement_score': self._calculate_engagement(post),
                            'post_url': f"https://reddit.com{post.get('post_url', '')}",
                            'scraped_at': datetime.utcnow().isoformat()
                        }
                        all_mentions.append(mention)
                        
                except Exception as e:
                    print(f"Error scraping Reddit r/{subreddit} for {brand}: {e}")
        
        return all_mentions
    
    def monitor_twitter_mentions(self, brand_handles, keywords=None):
        """Monitor Twitter/X mentions (requires Nitter or similar)"""
        # Note: Direct Twitter scraping is restricted
        # Use Nitter instances or Twitter API v2 for this
        
        mentions = []
        
        for handle in brand_handles:
            # Using Nitter as an alternative frontend
            url = f"https://nitter.net/{handle}"
            
            try:
                data = scrape(
                    url=url,
                    api_key=self.api_key,
                    extract_schema={
                        'tweets': {
                            'selector': '.timeline-item',
                            'type': 'list',
                            'fields': {
                                'content': '.tweet-content',
                                'author': '.username',
                                'date': '.tweet-date a',
                                'replies': '.tweet-stat .icon-reply + div',
                                'retweets': '.tweet-stat .icon-retweet + div',
                                'likes': '.tweet-stat .icon-heart + div'
                            }
                        }
                    }
                )
                
                for tweet in data.get('tweets', []):
                    mentions.append({
                        'platform': 'twitter',
                        'brand_handle': handle,
                        'content': tweet.get('content', ''),
                        'author': tweet.get('author', ''),
                        'date': tweet.get('date', ''),
                        'engagement': {
                            'replies': self._extract_number(tweet.get('replies', '0')),
                            'retweets': self._extract_number(tweet.get('retweets', '0')),
                            'likes': self._extract_number(tweet.get('likes', '0'))
                        },
                        'scraped_at': datetime.utcnow().isoformat()
                    })
                    
            except Exception as e:
                print(f"Error monitoring Twitter for {handle}: {e}")
        
        return mentions
    
    def monitor_quora_discussions(self, brand_names):
        """Monitor Quora questions and answers about brands"""
        discussions = []
        
        for brand in brand_names:
            url = f"https://www.quora.com/search?q={brand.replace(' ', '+')}"
            
            try:
                data = scrape(
                    url=url,
                    api_key=self.api_key,
                    extract_schema={
                        'questions': {
                            'selector': '.q-box',
                            'type': 'list',
                            'fields': {
                                'question': '.question_text',
                                'answer_preview': '.answer_text',
                                'upvotes': '.upvote_count',
                                'views': '.view_count',
                                'author': '.user_name'
                            }
                        }
                    }
                )
                
                for q in data.get('questions', []):
                    discussions.append({
                        'platform': 'quora',
                        'brand': brand,
                        'question': q.get('question', ''),
                        'answer_preview': q.get('answer_preview', ''),
                        'author': q.get('author', ''),
                        'upvotes': self._extract_number(q.get('upvotes', '0')),
                        'views': self._extract_number(q.get('views', '0')),
                        'scraped_at': datetime.utcnow().isoformat()
                    })
                    
            except Exception as e:
                print(f"Error monitoring Quora for {brand}: {e}")
        
        return discussions
    
    def _extract_number(self, text):
        """Extract numeric value from text"""
        import re
        if not text:
            return 0
        match = re.search(r'[\d,]+', str(text))
        return int(match.group().replace(',', '')) if match else 0
    
    def _calculate_engagement(self, post):
        """Calculate engagement score for a post"""
        upvotes = self._extract_number(post.get('upvotes', '0'))
        comments = self._extract_number(post.get('comment_count', '0'))
        return upvotes + (comments * 2)  # Comments weighted more heavily

News and Media Sentiment Tracking

News coverage significantly impacts brand perception. Monitoring media sentiment helps identify PR opportunities and potential crises:

class NewsSentimentMonitor:
    def __init__(self, api_key):
        self.api_key = api_key
        self.news_sources = [
            'https://news.google.com/search?q={query}',
            'https://www.bing.com/news/search?q={query}',
        ]
    
    def monitor_news_mentions(self, brand_names, days_back=7):
        """Monitor news coverage for brand mentions"""
        from datetime import datetime, timedelta
        
        all_articles = []
        cutoff_date = datetime.now() - timedelta(days=days_back)
        
        for brand in brand_names:
            query = f'"{brand}"'
            
            # Google News
            url = f"https://news.google.com/search?q={query.replace(' ', '%20')}"
            
            try:
                data = scrape(
                    url=url,
                    api_key=self.api_key,
                    extract_schema={
                        'articles': {
                            'selector': 'article',
                            'type': 'list',
                            'fields': {
                                'headline': 'h3 a',
                                'source': '.vr1PYe',
                                'publish_time': 'time',
                                'snippet': '.Y3v8qd',
                                'link': {'selector': 'h3 a', 'attribute': 'href'}
                            }
                        }
                    },
                    wait_for='article'
                )
                
                for article in data.get('articles', []):
                    # Convert relative Google News URLs
                    link = article.get('link', '')
                    if link.startswith('./'):
                        link = f"https://news.google.com{link[1:]}"
                    
                    all_articles.append({
                        'platform': 'google_news',
                        'brand': brand,
                        'headline': article.get('headline', ''),
                        'source': article.get('source', ''),
                        'publish_time': article.get('publish_time', ''),
                        'snippet': article.get('snippet', ''),
                        'link': link,
                        'scraped_at': datetime.utcnow().isoformat()
                    })
                    
            except Exception as e:
                print(f"Error monitoring news for {brand}: {e}")
        
        return all_articles
    
    def analyze_headline_sentiment(self, headlines):
        """Simple rule-based headline sentiment analysis"""
        positive_words = ['launch', 'growth', 'success', 'innovation', 'partnership', 
                         'award', 'milestone', 'expansion', 'breakthrough', 'record']
        negative_words = ['lawsuit', 'breach', 'scandal', 'crisis', 'layoff', 
                         'decline', 'failure', 'controversy', 'investigation', 'fine']
        
        results = []
        for headline in headlines:
            headline_lower = headline.lower()
            pos_count = sum(1 for w in positive_words if w in headline_lower)
            neg_count = sum(1 for w in negative_words if w in headline_lower)
            
            if neg_count > pos_count:
                sentiment = 'negative'
            elif pos_count > neg_count:
                sentiment = 'positive'
            else:
                sentiment = 'neutral'
            
            results.append({
                'headline': headline,
                'sentiment': sentiment,
                'positive_indicators': pos_count,
                'negative_indicators': neg_count
            })
        
        return results

Building a Real-Time Sentiment Dashboard

Aggregating data from multiple sources into a unified monitoring system enables proactive brand management:

# sentiment_dashboard.py - Real-time brand sentiment monitoring
from datetime import datetime, timedelta
import asyncio
from celery import Celery
import pandas as pd

app = Celery('sentiment_monitor', broker='redis://localhost:6379')

class SentimentDashboard:
    def __init__(self, api_key):
        self.api_key = api_key
        self.review_scraper = ReviewScraper(api_key)
        self.social_monitor = SocialMediaMonitor(api_key)
        self.news_monitor = NewsSentimentMonitor(api_key)
        self.analyzer = ReviewAnalyzer()
    
    @app.task
    def daily_sentiment_snapshot(brand_names):
        """Generate daily sentiment snapshot for tracked brands"""
        dashboard = SentimentDashboard(os.getenv('PAPALILY_API_KEY'))
        
        snapshot = {
            'generated_at': datetime.utcnow().isoformat(),
            'brands': {}
        }
        
        for brand in brand_names:
            brand_data = {
                'reviews': dashboard.review_scraper.aggregate_reviews(brand, max_pages=3),
                'social_mentions': dashboard.social_monitor.monitor_reddit_mentions([brand]),
                'news_mentions': dashboard.news_monitor.monitor_news_mentions([brand])
            }
            
            # Calculate composite sentiment score
            brand_data['composite_score'] = dashboard._calculate_composite_score(brand_data)
            
            # Identify trends
            brand_data['trends'] = dashboard._identify_trends(brand, brand_data)
            
            snapshot['brands'][brand] = brand_data
        
        # Store snapshot
        store_sentiment_snapshot(snapshot)
        
        # Alert on significant changes
        dashboard._check_alerts(snapshot)
        
        return snapshot
    
    def _calculate_composite_score(self, brand_data):
        """Calculate weighted composite sentiment score"""
        scores = []
        
        # Review sentiment (40% weight)
        reviews = brand_data.get('reviews', {})
        if reviews.get('total_reviews', 0) > 0:
            sentiment_dist = reviews.get('sentiment_distribution', {})
            total = sum(sentiment_dist.values())
            if total > 0:
                review_score = (
                    (sentiment_dist.get('positive', 0) * 1) +
                    (sentiment_dist.get('neutral', 0) * 0.5) +
                    (sentiment_dist.get('negative', 0) * 0)
                ) / total
                scores.append(('reviews', review_score, 0.4))
        
        # Social sentiment (35% weight)
        social = brand_data.get('social_mentions', [])
        if social:
            # Simple sentiment estimation based on engagement
            avg_engagement = sum(s.get('engagement_score', 0) for s in social) / len(social)
            # Higher engagement on positive posts is good
            social_score = min(avg_engagement / 100, 1)  # Normalize
            scores.append(('social', social_score, 0.35))
        
        # News sentiment (25% weight)
        news = brand_data.get('news_mentions', [])
        if news:
            headlines = [n['headline'] for n in news]
            sentiment_analysis = self.news_monitor.analyze_headline_sentiment(headlines)
            pos_count = sum(1 for s in sentiment_analysis if s['sentiment'] == 'positive')
            neg_count = sum(1 for s in sentiment_analysis if s['sentiment'] == 'negative')
            total = len(sentiment_analysis)
            if total > 0:
                news_score = (pos_count - neg_count + total) / (2 * total)
                scores.append(('news', news_score, 0.25))
        
        # Calculate weighted average
        if scores:
            total_weight = sum(s[2] for s in scores)
            weighted_sum = sum(s[1] * s[2] for s in scores)
            return round((weighted_sum / total_weight) * 100, 2)
        
        return 50  # Neutral default
    
    def _identify_trends(self, brand, brand_data):
        """Identify sentiment trends and patterns"""
        trends = {
            'emerging_issues': [],
            'positive_highlights': [],
            'volume_changes': {},
            'competitor_comparison': {}
        }
        
        # Check for emerging issues in reviews
        reviews = brand_data.get('reviews', {}).get('recent_reviews', [])
        if reviews:
            issues = self.analyzer.identify_emerging_issues(reviews)
            trends['emerging_issues'] = issues[:5]  # Top 5
        
        # Analyze positive keywords
        all_reviews = brand_data.get('reviews', {}).get('recent_reviews', [])
        positive_phrases = self.analyzer.extract_key_phrases(all_reviews, 'positive')
        trends['positive_highlights'] = positive_phrases['top_bigrams'][:5]
        
        return trends
    
    def _check_alerts(self, snapshot):
        """Check for conditions requiring alerts"""
        for brand, data in snapshot['brands'].items():
            score = data.get('composite_score', 50)
            
            # Alert on significant sentiment drop
            if score < 30:
                send_alert('sentiment_drop', brand, score)
            
            # Alert on emerging issues
            issues = data.get('trends', {}).get('emerging_issues', [])
            if len(issues) > 3:
                send_alert('multiple_issues', brand, issues)
            
            # Alert on negative news spike
            news = data.get('news_mentions', [])
            negative_news = [
                n for n in news 
                if self.news_monitor.analyze_headline_sentiment([n['headline']])[0]['sentiment'] == 'negative'
            ]
            if len(negative_news) > 2:
                send_alert('negative_news_spike', brand, negative_news)
    
    def generate_competitor_comparison(self, brand_names, metric='composite_score'):
        """Generate competitive sentiment analysis"""
        comparison = {
            'generated_at': datetime.utcnow().isoformat(),
            'metric': metric,
            'rankings': []
        }
        
        for brand in brand_names:
            # Fetch latest snapshot
            snapshot = get_latest_snapshot(brand)
            if snapshot:
                comparison['rankings'].append({
                    'brand': brand,
                    'score': snapshot.get('composite_score', 0),
                    'review_count': snapshot.get('reviews', {}).get('total_reviews', 0),
                    'social_mentions': len(snapshot.get('social_mentions', [])),
                    'news_mentions': len(snapshot.get('news_mentions', []))
                })
        
        # Sort by score
        comparison['rankings'].sort(key=lambda x: x['score'], reverse=True)
        
        return comparison

Advanced Sentiment Analysis Techniques

Moving beyond basic polarity detection, advanced techniques extract deeper insights from scraped content:

Aspect-based sentiment analysis: Identify sentiment toward specific product features, not just overall opinion
Emotion detection: Classify text into emotions like joy, anger, fear, sadness, and surprise
Sarcasm detection: Identify when positive words convey negative sentiment
Intent classification: Determine if text indicates purchase intent, churn risk, or advocacy
Influencer scoring: Identify authors with disproportionate impact on sentiment
Temporal analysis: Track how sentiment evolves in response to events

Integration Tip: Combine scraped sentiment data with internal metrics like support ticket sentiment, NPS scores, and churn data for a complete customer health picture.

Ethical Considerations and Best Practices

Sentiment monitoring operates at the intersection of data collection and privacy. Responsible implementation requires attention to:

Privacy Alert: Never scrape private messages, password-protected content, or personal information not intended for public consumption. Focus on publicly posted reviews, comments, and articles.

Terms of Service compliance: Review and respect each platform's terms of service
Rate limiting: Implement respectful scraping rates to avoid overloading target servers
Data retention: Establish clear policies for how long sentiment data is stored
Transparency: Consider disclosing monitoring activities where appropriate
Bias awareness: Recognize that scraped data may not represent all customer segments equally
Actionable focus: Use sentiment data to improve products and services, not just monitor

Future of Sentiment Intelligence

The sentiment analysis landscape continues to evolve rapidly:

Multimodal analysis: Analyzing sentiment in images, videos, and audio alongside text
Real-time streaming: Processing sentiment as content is published, not in batches
Predictive sentiment: Forecasting sentiment shifts before they occur
Cross-lingual analysis: Unified sentiment tracking across languages
Contextual understanding: AI models that understand industry-specific terminology and nuance
Privacy-preserving techniques: Analyzing sentiment without storing raw content

Build Your Brand Intelligence System with Papalily

Ready to unlock the power of sentiment analysis? Papalily's AI-powered scraping API makes it easy to collect reviews, social mentions, and news coverage from across the web—giving you the data foundation for powerful brand intelligence.

Start Monitoring Your Brand Sentiment →

Conclusion

Web scraping for sentiment analysis and brand monitoring has evolved from a nice-to-have capability to a strategic necessity. In an era where public opinion forms in minutes and spreads globally in seconds, organizations that systematically collect, analyze, and act on sentiment data gain decisive competitive advantages.

The technologies and techniques outlined in this guide provide a foundation for building sophisticated brand intelligence systems. From aggregating customer reviews across platforms to monitoring social conversations and tracking news coverage, comprehensive sentiment monitoring enables proactive reputation management, data-driven product development, and competitive positioning.

Success requires more than just technical implementation—it demands ethical consideration, strategic focus, and commitment to turning insights into action. Organizations that master sentiment intelligence will be best positioned to build lasting customer relationships, navigate crises effectively, and maintain competitive advantage in an increasingly transparent and connected world.