Binary classification project predicting whether an individual's annual income exceeds $50K based on demographic and employment attributes from the UCI Adult Census dataset.
- Overview
- Dataset
- Project Workflow
- Tech Stack
- Project Structure
- Installation
- Usage
- Exploratory Data Analysis
- Modeling
- Results
- Model Evaluation
- Conclusion
- Future Improvements
- Author
- License
This project applies supervised machine learning to predict whether a person earns more than $50K per year using the well-known UCI Adult Census Income dataset. The pipeline covers data cleaning, exploratory data analysis (EDA), feature engineering, model training, hyperparameter tuning, and evaluation β all packaged in a reproducible workflow suitable for portfolio and internship presentation.
- Source: UCI Machine Learning Repository β Adult Dataset
- Records: ~48,842 instances
- Features: 14 demographic and employment attributes
- Target:
incomeβ<=50Kor>50K
Key Features
age,workclass,education,education-nummarital-status,occupation,relationshiprace,sex,capital-gain,capital-losshours-per-week,native-country
- Data Loading & Inspection
- Data Cleaning β handling missing values & duplicates
- Exploratory Data Analysis (EDA)
- Feature Engineering & Encoding
- Train / Test Split & Scaling
- Model Training (multiple algorithms)
- Hyperparameter Tuning
- Model Evaluation & Comparison
- Model Persistence (
.pkl) - Conclusion & Insights
| Category | Tools |
|---|---|
| Language | Python 3.9+ |
| Data Handling | pandas, numpy |
| Visualization | matplotlib, seaborn |
| Machine Learning | scikit-learn |
| Model Persistence | joblib |
| Environment | Jupyter Notebook |
salary-prediction/
β
βββ data/
β βββ adult.csv
β
βββ notebooks/
β βββ salary_prediction.ipynb
β
βββ models/
β βββ best_model.pkl
β
βββ images/
β βββ eda_plots/
β βββ correlation_heatmap.png
β βββ feature_importance.png
β βββ confusion_matrix.png
β βββ roc_curve.png
β
βββ requirements.txt
βββ README.md
βββ LICENSE
βββ .gitignore
1. Clone the repository
git clone https://github.com/ParvathyM155/Salary_Prediction_Using_Machine_Learning.git
cd Salary_Prediction_Using_Machine_Learning2. Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # macOS / Linux
venv\Scripts\activate # Windows3. Install dependencies
pip install -r requirements.txtLaunch the notebook:
jupyter notebook notebooks/salary_prediction.ipynbOr load the saved model directly in Python:
import joblib
model = joblib.load("models/best_model.pkl")
prediction = model.predict(new_data)Key insights uncovered during EDA:
- Strong correlation between education level and income.
- Hours-per-week and age significantly affect earning probability.
- Marital status and occupation are powerful categorical predictors.
- The dataset is imbalanced (~76%
<=50K, ~24%>50K).
Visualizes pairwise relationships between numerical features and the target variable.
The following classifiers were trained and compared:
- Logistic Regression
- Decision Tree
- Random Forest
- Gradient Boosting
- K-Nearest Neighbors
- Support Vector Machine
Hyperparameter tuning was performed using GridSearchCV with cross-validation.
Top features driving the Gradient Boosting model's predictions.
| Model | Accuracy | Precision | Recall | F1-Score | ROC-AUC |
|---|---|---|---|---|---|
| Logistic Regression | 0.85 | 0.74 | 0.60 | 0.66 | 0.90 |
| Random Forest | 0.86 | 0.76 | 0.63 | 0.69 | 0.91 |
| Gradient Boosting | 0.87 | 0.79 | 0.65 | 0.71 | 0.92 |
β Gradient Boosting achieved the best overall performance and was selected as the final model.
Evaluation techniques applied:
- Confusion Matrix
- Classification Report
- ROC Curve & AUC
- PrecisionβRecall Curve
- K-Fold Cross-Validation
- Feature Importance Analysis
Breakdown of correct vs incorrect predictions for each income class.
Trade-off between true-positive and false-positive rates across thresholds.
The final Gradient Boosting model reliably predicts income brackets with ~87% accuracy and a 0.92 ROC-AUC, demonstrating strong generalization. Education, age, hours-per-week, and capital gain emerged as the most influential features β aligning with real-world economic intuition.
- Address class imbalance using SMOTE or class weighting
- Experiment with XGBoost and LightGBM
- Deploy the model as a Streamlit or Flask web app
- Add MLflow for experiment tracking
- Build a CI pipeline with GitHub Actions
Parvathy M
- π Portfolio: yourwebsite.com
- πΌ LinkedIn: linkedin.com/in/parvathym155
- π GitHub: @ParvathyM155
This project is licensed under the MIT License β see the LICENSE file for details.
β If you found this project helpful, please consider giving it a star!



