This comprehensive project analyzes sentiment and topics from social media text data using advanced Natural Language Processing (NLP), machine learning, and deep learning techniques. By processing Twitter data through multiple analytical layers, we extract valuable insights about public opinion, trending topics, and emotional responses to various subjects. The resulting dashboard visualizes these insights through interactive charts and word clouds, making complex data patterns easily understandable.
Social-Media-Sentiment-Analysis/
โโโ preprocess.py # Text cleaning, stopword removal, language detection
โโโ analysis.py # Sentiment scoring, KMeans clustering, entity extraction, topic modeling
โโโ app.py # Streamlit dashboard for visualization
โโโ requirements.txt # Project dependencies
โโโ README.md # Project overview and usage instructions
- Text Cleaning: Removes URLs, emojis, mentions, hashtags, and special characters
- Normalization: Converts text to lowercase and removes extra whitespace
- Stopword Removal: Filters out common English stopwords using NLTK
- Language Detection: Uses the
langdetectlibrary to filter for English tweets only - Data Filtering: Handles missing values and removes duplicate entries
# Example of the text cleaning function
def clean_text(text):
text = str(text)
text = re.sub(r"http\S+|www.\S+", "", text) # Remove URLs
text = emoji.replace_emoji(text, replace='') # Remove emojis
text = re.sub(r"@\w+|#\w+", "", text) # Remove mentions & hashtags
text = re.sub(r"[^A-Za-z\s]", "", text) # Keep only letters and spaces
text = text.lower()
text = ' '.join(word for word in text.split() if word not in stop_words)
return text-
Sentiment Analysis:
- Uses
TextBlobto calculate polarity (positive/negative) and subjectivity scores - Polarity ranges from -1 (negative) to 1 (positive)
- Subjectivity ranges from 0 (objective) to 1 (subjective)
- Uses
-
Sentiment Clustering:
- Applies K-means clustering to group tweets with similar sentiment patterns
- Standardizes features to ensure equal weighting of polarity and subjectivity
- Maps clusters to sentiment scores (1-5) based on average polarity
-
Named Entity Recognition:
- Uses
spaCy's pre-trained models to identify organizations, people, products, locations, and brands - Calculates frequency of entities to identify key players in discussions
- Filters entities by type to focus on relevant categories
- Uses
-
Topic Modeling:
- Generates sentence embeddings using
sentence-transformers(all-mpnet-base-v2) - Applies K-means clustering to group semantically similar content
- Identifies dominant words in each cluster to label topics
- Creates vector representations that capture deeper semantic meaning beyond simple word frequency
- Generates sentence embeddings using
# Example of sentiment extraction
def get_sentiment(text):
analysis = TextBlob(text)
return analysis.sentiment.polarity, analysis.sentiment.subjectivity- Framework: Built with
Streamlitfor rapid web app deployment - Visualization Libraries: Utilizes
Plotlyfor interactive charts andWordCloudfor visual text analysis - Key Visualizations:
- Sentiment distribution bar chart showing emotion patterns across dataset
- Interactive scatter plot of polarity vs. subjectivity with hover data
- Dynamic word clouds for each sentiment cluster highlighting key terms
- Bar chart of top named entities showing key discussion subjects
- Topic cluster overview with representative keywords
git clone https://github.com/yourusername/Social-Media-Sentiment-Analysis.git
cd Social-Media-Sentiment-Analysispip install -r requirements.txt
python -c "import nltk; nltk.download('stopwords')"
python -m spacy download en_core_web_smThe input dataset should be a CSV file with a text column (default column name is clean_text). You can modify the column name in the code if needed.
# Step 1: Clean and prepare the data
python preprocess.py
# Step 2: Perform sentiment analysis and clustering
python analysis.py
# Step 3: Launch the dashboard
streamlit run app.pyVisualizes the distribution of sentiment scores (1-5) across the dataset, providing an immediate overview of public opinion (positive, negative, or neutral).
An interactive scatter plot that maps tweets based on their sentiment polarity (x-axis) and subjectivity (y-axis). Points are colored by cluster, and hovering reveals the actual tweet text.
Generates visual representations of the most frequent words in each sentiment cluster, with word size proportional to frequency. This provides an intuitive understanding of the vocabulary associated with different emotional responses.
Displays the most frequently mentioned organizations, people, products, locations, and brands in the dataset, highlighting key subjects of discussion.
Shows the dominant words for each topic cluster, helping to identify the main themes and subjects being discussed across the dataset.
The project uses the all-mpnet-base-v2 model from the sentence-transformers library to generate high-quality vector representations of tweets. These 768-dimensional embeddings capture semantic relationships between texts far better than traditional bag-of-words approaches.
We use K-means clustering with 5 clusters (configurable) for both sentiment and topic clustering. The number of clusters can be adjusted based on dataset size and desired granularity.
# Example of the clustering implementation
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
df['cluster'] = kmeans.fit_predict(features)- The embedding and entity recognition processes are computationally intensive for large datasets
- For datasets >10,000 tweets, consider sampling or batch processing
- The Streamlit dashboard is optimized for datasets up to 100,000 tweets
# In analysis.py, modify the n_clusters parameter
kmeans = KMeans(n_clusters=YOUR_DESIRED_NUMBER, random_state=42, n_init=10)# In analysis.py, replace the model name with an alternative from sentence-transformers
model = SentenceTransformer('YOUR_PREFERRED_MODEL')The Streamlit framework makes it easy to extend the dashboard with additional visualizations. See the Streamlit documentation for more options.
- Data Quality: Better preprocessing leads to more accurate sentiment analysis
- Language Considerations: The model performs best on English text; use language filtering
- Entity Recognition: May need tuning for domain-specific terms
- Dashboard Performance: Limit the number of tweets displayed in interactive elements for better performance
- Sentiment Interpretation: Consider industry benchmarks when interpreting sentiment scores
- Temporal Analysis: Adding time-based tracking of sentiment changes
- Comparative Analysis: Facilitating comparison between different topics or time periods
- Advanced NLP: Integration with more sophisticated models like BERT or GPT
- Real-time Processing: Adding capability for stream processing of incoming tweets
- Geospatial Analysis: Mapping sentiment patterns by geographic location
For suggestions, issues, or contributions, please:
- Open an issue on GitHub
- Submit a pull request with proposed changes
- Contact the maintainer at:
nukanarendra2006@gmail.com
This project was developed as part of an effort to better understand public opinion through social media analysis using modern NLP techniques.