This repository contains my submission for Task 1 of the AI & ML Internship: focused on data cleaning and preprocessing using the Titanic dataset.
To clean and prepare raw Titanic dataset for Machine Learning models. This includes:
- Handling missing data
- Encoding categorical features
- Scaling numerical values
- Outlier detection and removal
- Python 🐍
- Pandas & NumPy 📊
- Matplotlib & Seaborn 📈
- Scikit-learn 🔧
- Titanic Dataset on Kaggle
- File used:
titanic.csv
- Checked data types, null values, and structure of the dataset.
- Filled
Agewith median. - Filled
Embarkedwith mode. - Dropped
Cabindue to excessive missing values.
- Encoded
Sexusing binary mapping (male=0, female=1). - Applied One-Hot Encoding to
Embarked.
- Standardized
AgeandFareusingStandardScaler.
- Plotted boxplots for
AgeandFare. - Removed outliers from
Fareusing the IQR method.
- Real-world datasets are messy — preprocessing is a critical step before model training.
- Learned various techniques like median/mode imputation, encoding, and standardization.
- Visualized and cleaned outliers for better model input.
- Types of missing data: MCAR, MAR, MNAR
- Handling categorical variables: Label and One-Hot Encoding
- Normalization vs Standardization:
- Normalization → scales to [0, 1]
- Standardization → zero mean, unit variance
- Outlier detection: Boxplot, IQR
- Importance of preprocessing: Ensures model quality, avoids bias
- Data imbalance handling: SMOTE, Resampling, Class weights
- Encoding methods: One-Hot (independent binary columns), Label (ordinal values)
- Effect on model accuracy: Strong impact — better preprocessing leads to better models
Here’s a sample of the cleaned dataset:
Data Preprocessing Completed
PassengerId Survived Pclass
0 1 0 3
2 3 1 3
3 4 1 1
4 5 0 3
5 6 0 3
Name Sex Age SibSp Parch \
0 Braund, Mr. Owen Harris 0 -0.565736 1 0
2 Heikkinen, Miss. Laina 1 -0.258337 0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1 0.433312 1 0
4 Allen, Mr. William Henry 0 0.433312 0 0
5 Moran, Mr. James 0 -0.104637 0 0
Ticket Fare Embarked_Q Embarked_S
0 A/5 21171 -0.502445 False True
2 STON/O2. 3101282 -0.488854 False True
3 113803 0.420730 False True
4 373450 -0.486337 False True
5 330877 -0.478116 True False