🧼 Task 1: Data Cleaning & Preprocessing - Titanic Dataset

This repository contains my submission for Task 1 of the AI & ML Internship: focused on data cleaning and preprocessing using the Titanic dataset.

📌 Objective

To clean and prepare raw Titanic dataset for Machine Learning models. This includes:

Handling missing data
Encoding categorical features
Scaling numerical values
Outlier detection and removal

🛠️ Tools & Technologies Used

Python 🐍
Pandas & NumPy 📊
Matplotlib & Seaborn 📈
Scikit-learn 🔧

📁 Dataset

Titanic Dataset on Kaggle
File used: titanic.csv

🔍 Steps Performed

1. Imported and Explored Data

Checked data types, null values, and structure of the dataset.

2. Handled Missing Values

Filled Age with median.
Filled Embarked with mode.
Dropped Cabin due to excessive missing values.

3. Categorical Encoding

Encoded Sex using binary mapping (male=0, female=1).
Applied One-Hot Encoding to Embarked.

4. Feature Scaling

Standardized Age and Fare using StandardScaler.

5. Outlier Detection and Removal

Plotted boxplots for Age and Fare.
Removed outliers from Fare using the IQR method.

📊 Outlier Boxplot

![Boxplot for Age and Fare]

🧠 Key Learnings

Real-world datasets are messy — preprocessing is a critical step before model training.
Learned various techniques like median/mode imputation, encoding, and standardization.
Visualized and cleaned outliers for better model input.

❓ Sample Interview Questions Answered

Types of missing data: MCAR, MAR, MNAR
Handling categorical variables: Label and One-Hot Encoding
Normalization vs Standardization:
- Normalization → scales to [0, 1]
- Standardization → zero mean, unit variance
Outlier detection: Boxplot, IQR
Importance of preprocessing: Ensures model quality, avoids bias
Data imbalance handling: SMOTE, Resampling, Class weights
Encoding methods: One-Hot (independent binary columns), Label (ordinal values)
Effect on model accuracy: Strong impact — better preprocessing leads to better models

✅ Final Output Snapshot

Here’s a sample of the cleaned dataset:

Data Preprocessing Completed PassengerId Survived Pclass
0 1 0 3
2 3 1 3
3 4 1 1
4 5 0 3
5 6 0 3

                                       Name  Sex       Age  SibSp  Parch  \

0 Braund, Mr. Owen Harris 0 -0.565736 1 0
2 Heikkinen, Miss. Laina 1 -0.258337 0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1 0.433312 1 0
4 Allen, Mr. William Henry 0 0.433312 0 0
5 Moran, Mr. James 0 -0.104637 0 0

         Ticket      Fare  Embarked_Q  Embarked_S

0 A/5 21171 -0.502445 False True
2 STON/O2. 3101282 -0.488854 False True
3 113803 0.420730 False True
4 373450 -0.486337 False True
5 330877 -0.478116 True False

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
Titanic-Dataset.csv		Titanic-Dataset.csv
first.ipynb		first.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧼 Task 1: Data Cleaning & Preprocessing - Titanic Dataset

📌 Objective

🛠️ Tools & Technologies Used

📁 Dataset

🔍 Steps Performed

1. Imported and Explored Data

2. Handled Missing Values

3. Categorical Encoding

4. Feature Scaling

5. Outlier Detection and Removal

📊 Outlier Boxplot

🧠 Key Learnings

❓ Sample Interview Questions Answered

✅ Final Output Snapshot

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧼 Task 1: Data Cleaning & Preprocessing - Titanic Dataset

📌 Objective

🛠️ Tools & Technologies Used

📁 Dataset

🔍 Steps Performed

1. Imported and Explored Data

2. Handled Missing Values

3. Categorical Encoding

4. Feature Scaling

5. Outlier Detection and Removal

📊 Outlier Boxplot

🧠 Key Learnings

❓ Sample Interview Questions Answered

✅ Final Output Snapshot

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages