Skip to content

Sujyeet/bigdata-processing-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Big Data Processing Project

Project Overview

This project demonstrates scalable data processing pipelines using Apache Spark and PySpark. The implementation covers various big data scenarios including social media analytics (Twitter), recommendation systems (MovieLens), transportation analytics (Taxi data), real-time streaming, and cloud storage integration with Amazon S3.

Features

  • Twitter Data Analytics: Sentiment analysis and trend detection on Twitter datasets
  • MovieLens Recommendation System: Collaborative filtering and recommendation algorithms
  • Taxi Data Processing: Geospatial analysis and trip pattern recognition
  • Real-time Streaming: Live data processing using Spark Streaming
  • Cloud Integration: S3 data ingestion and processing workflows
  • Scalable Architecture: Distributed processing capabilities with PySpark

Folder Structure

bigdata-processing-project/
├── report/                 # Project documentation and analysis reports
├── sample_outputs/         # Example outputs and visualizations
├── src/                    # Source code (Python/PySpark scripts)
├── data/                   # Local datasets (if any)
├── config/                 # Configuration files
├── notebooks/              # Jupyter notebooks for analysis
└── README.md              # This file

Datasets Used

  • Twitter Dataset: Social media posts for sentiment analysis
  • MovieLens Dataset: Movie ratings and user preferences
  • NYC Taxi Dataset: Trip records and fare information
  • Streaming Data: Real-time event streams
  • S3 Data Sources: Cloud-stored datasets for distributed processing

Note: Actual datasets are not included in this repository. Instructions for obtaining and configuring datasets are provided in the individual script documentation.

Running the Code

Prerequisites

  • Apache Spark 3.x
  • Python 3.7+
  • PySpark
  • Required Python packages (see requirements.txt)

Setup Instructions

  1. Clone the repository:

    git clone https://github.com/Sujyeet/bigdata-processing-project.git
    cd bigdata-processing-project
  2. Install dependencies:

    pip install -r requirements.txt
  3. Configure Spark:

    • Ensure Spark is properly installed and configured
    • Set SPARK_HOME environment variable
    • Configure cluster settings if running in distributed mode
  4. Data Setup:

    • Download required datasets (links provided in documentation)
    • Configure data paths in config files
    • Set up S3 credentials if using cloud storage
  5. Execute Pipelines:

    # Twitter Analytics
    spark-submit src/twitter_analysis.py
    
    # MovieLens Processing
    spark-submit src/movielens_recommender.py
    
    # Taxi Data Analysis
    spark-submit src/taxi_analytics.py
    
    # Streaming Pipeline
    spark-submit src/streaming_processor.py

Configuration Notes

  • Modify configuration files in the config/ directory to match your environment
  • Adjust memory and core settings based on your cluster capacity
  • Update S3 bucket configurations for cloud processing
  • Set appropriate logging levels for debugging

Outputs & Results

The project generates various outputs stored in the sample_outputs/ directory:

  • Analytics Reports: Statistical summaries and insights
  • Visualizations: Charts and graphs showing data patterns
  • Processed Data: Cleaned and transformed datasets
  • Model Results: Recommendation scores and predictions
  • Performance Metrics: Processing times and resource utilization

Sample outputs include:

  • Twitter sentiment distribution charts
  • MovieLens recommendation accuracy metrics
  • Taxi trip heatmaps and patterns
  • Streaming data processing statistics

Report

Detailed technical documentation and analysis reports are available in the report/ directory, including:

  • Technical Architecture: System design and implementation details
  • Performance Analysis: Benchmarking and optimization results
  • Data Insights: Key findings and analytical outcomes
  • Methodology: Algorithms and approaches used
  • Lessons Learned: Best practices and recommendations

Author/Contact

Sujyeet
GitHub: @Sujyeet

For questions, suggestions, or collaboration opportunities, please:

  • Open an issue in this repository
  • Contact via GitHub profile
  • Submit pull requests for improvements

This project showcases modern big data processing techniques using industry-standard tools and frameworks. Feel free to explore, learn, and contribute!

About

Scalable PySpark pipelines – Twitter, MovieLens, Taxi, Streaming, S3 (no data)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors