Skip to content

ParvathyM155/Salary_Prediction_Using_Machine_Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

24 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ’Ό Salary Prediction β€” Adult Dataset

Binary classification project predicting whether an individual's annual income exceeds $50K based on demographic and employment attributes from the UCI Adult Census dataset.

Python scikit-learn License Status


πŸ“‘ Table of Contents


πŸ“Œ Overview

This project applies supervised machine learning to predict whether a person earns more than $50K per year using the well-known UCI Adult Census Income dataset. The pipeline covers data cleaning, exploratory data analysis (EDA), feature engineering, model training, hyperparameter tuning, and evaluation β€” all packaged in a reproducible workflow suitable for portfolio and internship presentation.


πŸ“Š Dataset

Key Features

  • age, workclass, education, education-num
  • marital-status, occupation, relationship
  • race, sex, capital-gain, capital-loss
  • hours-per-week, native-country

πŸ” Project Workflow

  1. Data Loading & Inspection
  2. Data Cleaning β€” handling missing values & duplicates
  3. Exploratory Data Analysis (EDA)
  4. Feature Engineering & Encoding
  5. Train / Test Split & Scaling
  6. Model Training (multiple algorithms)
  7. Hyperparameter Tuning
  8. Model Evaluation & Comparison
  9. Model Persistence (.pkl)
  10. Conclusion & Insights

πŸ›  Tech Stack

Category Tools
Language Python 3.9+
Data Handling pandas, numpy
Visualization matplotlib, seaborn
Machine Learning scikit-learn
Model Persistence joblib
Environment Jupyter Notebook

πŸ“ Project Structure

salary-prediction/
β”‚
β”œβ”€β”€ data/
β”‚   └── adult.csv
β”‚
β”œβ”€β”€ notebooks/
β”‚   └── salary_prediction.ipynb
β”‚
β”œβ”€β”€ models/
β”‚   └── best_model.pkl
β”‚
β”œβ”€β”€ images/
β”‚   └── eda_plots/
β”‚       β”œβ”€β”€ correlation_heatmap.png
β”‚       β”œβ”€β”€ feature_importance.png
β”‚       β”œβ”€β”€ confusion_matrix.png
β”‚       └── roc_curve.png
β”‚
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ README.md
β”œβ”€β”€ LICENSE
└── .gitignore

βš™οΈ Installation

1. Clone the repository

git clone https://github.com/ParvathyM155/Salary_Prediction_Using_Machine_Learning.git
cd Salary_Prediction_Using_Machine_Learning

2. Create and activate a virtual environment

python -m venv venv
source venv/bin/activate        # macOS / Linux
venv\Scripts\activate           # Windows

3. Install dependencies

pip install -r requirements.txt

▢️ Usage

Launch the notebook:

jupyter notebook notebooks/salary_prediction.ipynb

Or load the saved model directly in Python:

import joblib
model = joblib.load("models/best_model.pkl")
prediction = model.predict(new_data)

πŸ” Exploratory Data Analysis

Key insights uncovered during EDA:

  • Strong correlation between education level and income.
  • Hours-per-week and age significantly affect earning probability.
  • Marital status and occupation are powerful categorical predictors.
  • The dataset is imbalanced (~76% <=50K, ~24% >50K).

πŸ”₯ Correlation Heatmap

Visualizes pairwise relationships between numerical features and the target variable.

Correlation Heatmap


πŸ€– Modeling

The following classifiers were trained and compared:

  • Logistic Regression
  • Decision Tree
  • Random Forest
  • Gradient Boosting
  • K-Nearest Neighbors
  • Support Vector Machine

Hyperparameter tuning was performed using GridSearchCV with cross-validation.

🌟 Feature Importance

Top features driving the Gradient Boosting model's predictions.

Feature Importance


πŸ† Results

Model Accuracy Precision Recall F1-Score ROC-AUC
Logistic Regression 0.85 0.74 0.60 0.66 0.90
Random Forest 0.86 0.76 0.63 0.69 0.91
Gradient Boosting 0.87 0.79 0.65 0.71 0.92

βœ… Gradient Boosting achieved the best overall performance and was selected as the final model.


πŸ“ˆ Model Evaluation

Evaluation techniques applied:

  • Confusion Matrix
  • Classification Report
  • ROC Curve & AUC
  • Precision–Recall Curve
  • K-Fold Cross-Validation
  • Feature Importance Analysis

🧩 Confusion Matrix

Breakdown of correct vs incorrect predictions for each income class.

Confusion Matrix

πŸ“‰ ROC Curve

Trade-off between true-positive and false-positive rates across thresholds.

ROC Curve


βœ… Conclusion

The final Gradient Boosting model reliably predicts income brackets with ~87% accuracy and a 0.92 ROC-AUC, demonstrating strong generalization. Education, age, hours-per-week, and capital gain emerged as the most influential features β€” aligning with real-world economic intuition.


πŸš€ Future Improvements

  • Address class imbalance using SMOTE or class weighting
  • Experiment with XGBoost and LightGBM
  • Deploy the model as a Streamlit or Flask web app
  • Add MLflow for experiment tracking
  • Build a CI pipeline with GitHub Actions

πŸ‘€ Author

Parvathy M


πŸ“„ License

This project is licensed under the MIT License β€” see the LICENSE file for details.


⭐ If you found this project helpful, please consider giving it a star!

Releases

No releases published

Packages

 
 
 

Contributors