diff --git a/docs/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/dimensionality-reduction.mdx b/docs/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/dimensionality-reduction.mdx index e69de29..972626e 100644 --- a/docs/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/dimensionality-reduction.mdx +++ b/docs/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/dimensionality-reduction.mdx @@ -0,0 +1,99 @@ +--- +title: "Dimensionality Reduction: PCA & LDA" +sidebar_label: Dimensionality Reduction +description: "Reducing feature complexity while preserving information: Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA)." +tags: [data-science, dimensionality-reduction, pca, lda, feature-selection, unsupervised-learning] +--- + +In Machine Learning, more data isn't always better. The **Curse of Dimensionality** refers to the phenomenon where, as the number of features (dimensions) increases, the volume of the space increases so fast that the available data becomes sparse. This leads to overfitting and massive computational costs. + +Dimensionality Reduction aims to project high-dimensional data into a lower-dimensional space while retaining as much meaningful information as possible. + +## 1. Why Reduce Dimensions? + +1. **Visualization:** We cannot visualize data in 10 dimensions. Reducing it to 2D or 3D allows us to see clusters and patterns. +2. **Performance:** Fewer features mean faster training and lower memory usage. +3. **Noise Reduction:** By removing "redundant" features, we help the model focus on the most important signals. +4. **Multicollinearity:** It helps handle features that are highly correlated with each other. + +## 2. Principal Component Analysis (PCA) + +PCA is an **unsupervised** technique that finds the directions (Principal Components) where the variance of the data is maximized. + +* **Principal Component 1 (PC1):** The direction that captures the most spread in the data. +* **Principal Component 2 (PC2):** The direction perpendicular to PC1 that captures the next most spread. + +**Key Concept: Explained Variance** +In PCA, we often look at the "Scree Plot" to decide how many dimensions to keep. We typically aim to keep enough components to explain **95%** of the total variance. + +$$ +Var(PC_1) > Var(PC_2) > ... > Var(PC_n) +$$ + +## 3. Linear Discriminant Analysis (LDA) + +While PCA cares about *variance*, LDA is a **supervised** technique that cares about **separability**. + +* **Goal:** Project data onto a new axis that maximizes the distance between the means of different classes and minimizes the variance within each class. +* **Usage:** Often used as a preprocessing step for classification tasks. + +## 4. PCA vs. LDA: A Comparison + +| Feature | PCA | LDA | +| :--- | :--- | :--- | +| **Type** | Unsupervised (Ignores labels) | Supervised (Uses labels) | +| **Objective** | Maximize variance | Maximize class separability | +| **Application** | Feature compression, visualization | Preprocessing for classification | +| **Limit** | Max components = Total features | Max components = Number of classes - 1 | + +```mermaid +graph LR + subgraph Goal_PCA [PCA Objective] + V[Max Variance] + end + subgraph Goal_LDA [LDA Objective] + S[Max Class Separation] + end + Data[High Dimensional Data] --> PCA + Data --> LDA + PCA --> Goal_PCA + LDA --> Goal_LDA + +``` + +## 5. Implementation with Scikit-Learn + +```python +from sklearn.decomposition import PCA +from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA + +# 1. PCA: Reducing to 2 dimensions +pca = PCA(n_components=2) +X_pca = pca.fit_transform(X_scaled) +print(f"Explained Variance: {pca.explained_variance_ratio_}") + +# 2. LDA: Reducing based on target 'y' +lda = LDA(n_components=1) +X_lda = lda.fit_transform(X_scaled, y) + +``` + +:::warning Critical Note +Always perform **Feature Scaling** (Standardization) before applying PCA. Because PCA maximizes variance, a feature with a large scale (like 'Salary') will dominate the components even if it isn't the most important. +::: + +## 6. Other Notable Techniques + +* **t-SNE (t-Distributed Stochastic Neighbor Embedding):** Excellent for 2D/3D visualization of non-linear clusters. +* **UMAP (Uniform Manifold Approximation and Projection):** Faster and often preserves more global structure than t-SNE. +* **Autoencoders:** A type of Neural Network used to learn "bottleneck" representations of data. + +## References for More Details + +* **[StatQuest - PCA Clearly Explained](https://www.youtube.com/watch?v=FgakZw6K1QQ):** Visual learners wanting to understand the intuition behind the math. + +* **[Scikit-Learn - Decomposition Module](https://scikit-learn.org/stable/modules/decomposition.html):** Technical documentation on PCA, Factor Analysis, and Dictionary Learning. + +--- + +**You have now completed the Data Engineering and Preprocessing journey! You have learned how to collect data, clean it, engineer features, and compress them. You are finally ready to build and train your first Machine Learning model.** \ No newline at end of file diff --git a/docs/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/feature-engineering.mdx b/docs/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/feature-engineering.mdx index e69de29..e3736c8 100644 --- a/docs/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/feature-engineering.mdx +++ b/docs/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/feature-engineering.mdx @@ -0,0 +1,96 @@ +--- +title: The Art of Feature Engineering +sidebar_label: Feature Engineering +description: "A comprehensive guide to creating, transforming, and selecting features to maximize Machine Learning model performance." +tags: [feature-engineering, data-science, preprocessing, python, pandas] +--- + +:::note +"Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." — **Andrew Ng** +::: + +Feature engineering is the process of using domain knowledge to extract new variables from raw data that help machine learning algorithms learn faster and predict more accurately. + +## 1. Transforming Numerical Features + +Numerical data often needs to be reshaped to satisfy the mathematical assumptions of algorithms like Linear Regression or Neural Networks. + +### A. Scaling (Normalization & Standardization) +Most models are sensitive to the magnitude of numbers. If one feature is "Salary" ($50,000$) and another is "Age" ($25$), the model might think Salary is $2,000$ times more important simply because the numbers are larger. + +* **Standardization (Z-score):** Centers data at $\mu = 0$ with $\sigma = 1$. +* **Normalization (Min-Max):** Rescales data to a fixed range, usually $[0, 1]$. + +### B. Binning (Discretization) +Sometimes the exact value isn't as important as the "group" it belongs to. +* **Example:** Converting "Age" into "Child," "Teen," "Adult," and "Senior." +* **Why?** It can help handle outliers and capture non-linear relationships. + +## 2. Encoding Categorical Features + +Machine Learning models are mathematical equations; they cannot multiply a weight by "London" or "Paris." We must convert text into numbers. + +### A. One-Hot Encoding +Creates a new binary column ($0$ or $1$) for every unique category. +* **Best for:** Nominal data (no inherent order, like "Color" or "City"). + +### B. Ordinal Encoding +Assigns an integer to each category based on rank. +* **Best for:** Ordinal data (where order matters, like "Low," "Medium," "High"). + +## 3. Creating New Features (Feature Construction) + +This is where domain expertise shines. You combine existing columns to create a more powerful "signal." + +* **Interaction Features:** If you have `Width` and `Length`, creating `Area = Width * Length` might be more predictive for housing prices. +* **Ratios:** In finance, `Debt-to-Income Ratio` is often more useful than having `Debt` and `Income` as separate features. +* **Polynomial Features:** Creating $x^2$ or $x^3$ to capture curved relationships in the data. + +```mermaid +graph LR + A[Feature A: Price] --> C{Logic} + B[Feature B: SqFt] --> C + C --> New[New Feature: Price_per_SqFt] + style New fill:#f3e5f5,stroke:#7b1fa2,color:#333 + +``` + +## 4. Handling DateTime Features + +Raw timestamps (e.g., `2023-10-27 14:30:00`) are useless to a model. We must extract the cyclical patterns: + +* **Time of Day:** Morning, Afternoon, Evening, Night. +* **Day of Week:** Is it a weekend? (Useful for retail/traffic prediction). +* **Seasonality:** Month or Quarter (Useful for sales forecasting). + +## 5. Text Feature Engineering (NLP Basics) + +To turn "Natural Language" into features, we use techniques like: + +1. **Bag of Words (BoW):** Counting the frequency of each word. +2. **TF-IDF:** Weighting words by how unique they are to a specific document. +3. **Word Embeddings:** Converting words into dense vectors that capture meaning (e.g., Word2Vec). + +## 6. Feature Selection: "Less is More" + +Having too many features leads to the **Curse of Dimensionality**, causing the model to overfit on noise. + +* **Filter Methods:** Using statistical tests (like Correlation) to drop irrelevant features. +* **Wrapper Methods:** Training the model on different subsets of features to find the best combo (e.g., Recursive Feature Elimination). +* **Embedded Methods:** Models that perform feature selection during training (e.g., LASSO Regression uses regularization to zero out useless weights). + +## 7. The Golden Rules of Feature Engineering + +1. **Don't Leak Information:** Never use the `Target` variable to create a feature (this is called Data Leakage). +2. **Think Cyclically:** For time or angles, use circular transforms () so the model knows is close to . +3. **Visualize First:** Use scatter plots to see if a feature actually correlates with your target before spending hours engineering it. + +## References for More Details + +* **[Feature Engineering for Machine Learning (Alice Zheng)](https://www.oreilly.com/library/view/feature-engineering-for/9781491953235/):** Deep mathematical intuition. + +* **[Scikit-Learn Preprocessing Module](https://scikit-learn.org/stable/modules/preprocessing.html):** Practical code implementation for scaling and encoding. + +--- + +**Now that your features are engineered and ready, we need to ensure the data is mathematically balanced so no single feature dominates the learning process.** \ No newline at end of file diff --git a/docs/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/feature-scaling.mdx b/docs/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/feature-scaling.mdx index e69de29..b021c42 100644 --- a/docs/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/feature-scaling.mdx +++ b/docs/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/feature-scaling.mdx @@ -0,0 +1,116 @@ +--- +title: "Feature Scaling: Normalization & Standardization" +sidebar_label: Feature Scaling +description: "Mastering the techniques used to harmonize feature scales, ensuring faster convergence and better model accuracy." +tags: [data-cleaning, preprocessing, scaling, normalization, standardization, machine-learning] +--- + +Imagine you are training a model to predict house prices. You have two features: +1. **Number of Bedrooms:** Range 1–5 +2. **Square Footage:** Range 500–5000 + +Because 500 is much larger than 5, a model might "think" square footage is 100 times more important than bedrooms. **Feature Scaling** levels the playing field so that the model treats all features fairly based on their information, not their magnitude. + +## 1. Why do we scale? + +Scaling is mandatory for specific types of algorithms: +* **Distance-Based Algorithms:** KNN, K-Means, and SVM rely on Euclidean distance. Larger scales distort these distances. +* **Gradient Descent-Based Algorithms:** Neural Networks and Logistic Regression converge (find the answer) much faster when the "loss landscape" is spherical rather than elongated. +* **Principal Component Analysis (PCA):** Features with higher variance will dominate the principal components. + +## 2. Standardization (Z-Score Normalization) + +Standardization transforms data so that it has a **mean of 0** and a **standard deviation of 1**. + +**Formula:** + +$$ +z = \frac{x - \mu}{\sigma} +$$ + +* **$\mu$:** Mean of the feature. +* **$\sigma$:** Standard deviation of the feature. + +**When to use:** Use this when your data follows a **Gaussian (Normal) Distribution**. It is robust to outliers compared to Min-Max scaling and is the default choice for most ML algorithms (SVM, Linear Regression). + +## 3. Normalization (Min-Max Scaling) + +Normalization rescales the data into a fixed range, usually **[0, 1]**. + +**Formula:** + +$$ +x_{norm} = \frac{x - x_{min}}{x_{max} - x_{min}} +$$ + +**When to use:** Use this when you do **not** know the distribution of your data or when you know there are no significant outliers. It is highly used in **Image Processing** (scaling pixel values from 0–255 to 0–1) and **Neural Networks**. + +:::warning +Min-Max scaling is very sensitive to outliers. A single outlier at 1,000,000 will "squash" all your normal data points into a tiny range near 0. +::: + +## 4. Robust Scaling + +If your dataset contains many outliers that you cannot remove, use the **Robust Scaler**. Instead of using the mean and standard deviation, it uses the **Median** and the **Interquartile Range (IQR)**. + +**Formula:** + +$$ +x_{robust} = \frac{x - \text{median}}{Q_3 - Q_1} +$$ + + +## 5. Comparison Table + +| Method | Range | Distribution | Outlier Sensitivity | +| :--- | :--- | :--- | :--- | +| **Standardization** | $\approx$ [-3, 3] | Becomes $\mu=0, \sigma=1$ | Low (Robust) | +| **Normalization** | [0, 1] or [-1, 1] | Squashed into range | **High** | +| **Robust Scaling** | Varies | Median centered at 0 | **Very Low** | + +## 6. Implementation with Scikit-Learn + +```python +from sklearn.preprocessing import StandardScaler, MinMaxScaler + +data = [[100, 0.001], [8, 0.05], [50, 0.005], [88, 0.07]] + +# 1. Standardization +std_scaler = StandardScaler() +std_data = std_scaler.fit_transform(data) + +# 2. Normalization +min_max = MinMaxScaler() +norm_data = min_max.fit_transform(data) + +``` + +## 7. The Golden Rule: Fit on Train, Transform on Test + +One of the most common mistakes in Data Engineering is "Data Leakage." When scaling, you must: + +1. **Fit** the scaler only on the **Training Set**. +2. **Transform** the **Test Set** using the parameters () learned from the Training Set. + +```mermaid +graph TD + Data[Full Dataset] --> Split{Split} + Split --> Train[Training Set] + Split --> Test[Test Set] + Train --> Fit[Scaler.fit] + Fit --> Trans1[Scaler.transform Train] + Fit --> Trans2[Scaler.transform Test] + style Fit fill:#f3e5f5,stroke:#7b1fa2,color:#333 + +``` + +## References for More Details + +* **[Scikit-Learn Preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling):** Implementation details and alternative scalers (MaxAbsScaler). + +* **[About Feature Scaling (Article)](https://sebastianraschka.com/Articles/2014_about_feature_scaling.html):** A deep mathematical dive into why scaling matters for specific algorithms. + + +--- + +**Now that your features are cleaned, engineered, and scaled, you have a high-quality dataset. But before you train a model, you need to ensure you haven't given it too much information or too little.** \ No newline at end of file diff --git a/docs/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/feature-selection.mdx b/docs/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/feature-selection.mdx index e69de29..2ea0e90 100644 --- a/docs/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/feature-selection.mdx +++ b/docs/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/feature-selection.mdx @@ -0,0 +1,101 @@ +--- +title: "Feature Selection: Quality Over Quantity" +sidebar_label: Feature Selection +description: "Techniques for identifying and keeping only the most relevant features using filter, wrapper, and embedded methods." +tags: [data-science, feature-selection, machine-learning, statistics, overfitting] +--- + +**Feature Selection** is the process of reducing the number of input variables when developing a predictive model. Unlike [Dimensionality Reduction](/tutorial/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/dimensionality-reduction), which transforms features into a new space, Feature Selection keeps the original features but removes the ones that are redundant or irrelevant. + +## 1. Why Select Features? + +1. **Reduce Overfitting:** Less redundant data means fewer opportunities to make decisions based on noise. +2. **Improve Accuracy:** Removing misleading features can improve the model's predictive power. +3. **Reduce Training Time:** Fewer data points mean faster algorithms. +4. **Enhanced Interpretability:** It is easier to explain a model with 5 key drivers than one with 500. + +## 2. The Three Families of Selection + +We categorize selection techniques into three main strategies: + +```mermaid +graph TD + FS[Feature Selection] --> Filter[Filter Methods] + FS --> Wrapper[Wrapper Methods] + FS --> Embedded[Embedded Methods] + + Filter --> F1[Correlation, Chi-Square, Mutual Info] + Wrapper --> W1[Forward/Backward Selection, RFE] + Embedded --> E1[LASSO Regression, Random Forest Importance] + +``` + +### A. Filter Methods (Statistical) + +These methods act as a "filter" before the training begins. They look at the intrinsic properties of the features (like their relationship with the target) using statistical tests. + +* **Correlation Coefficient:** Used to find linear relationships between features and targets. +* **Chi-Square:** Used for categorical features. +* **Mutual Information:** Measures how much information the presence of a feature contributes to the target. + +### B. Wrapper Methods (Iterative) + +These methods treat the selection process as a search problem. They train a model on different subsets of features and "wrap" the selection around the model's performance. + +* **Forward Selection:** Start with 0 features and add them one by one. +* **Recursive Feature Elimination (RFE):** Start with all features and prune the least important ones iteratively. + +![The process of Recursive Feature Elimination](/img/tutorials/ml/recursive-feature-elimination.jpg) + +### C. Embedded Methods (Integrated) + +These algorithms have feature selection built directly into their training process. + +* **LASSO (L1 Regularization):** Adds a penalty to the model that forces the coefficients of useless features to become exactly zero. +* **Tree-based Importance:** Algorithms like Random Forest or XGBoost naturally calculate which features were used most often to split the data. + +![Image showing Random Forest feature importance bar chart](/img/tutorials/ml/random-forest-feature-importance.jpg) + +## 3. Comparison Table + +| Method | Speed | Risk of Overfitting | Model Agnostic? | +| --- | --- | --- | --- | +| **Filter** | ⚡ Very Fast | Low | Yes | +| **Wrapper** | 🐢 Very Slow | **High** | Yes | +| **Embedded** | 🏎️ Fast | Moderate | No (Model-specific) | + +## 4. Implementation Example (RFE) + +Using Scikit-Learn to perform Recursive Feature Elimination with a Logistic Regression model: + +```python +from sklearn.feature_selection import RFE +from sklearn.linear_model import LogisticRegression + +# Define the model +model = LogisticRegression() + +# Select top 5 features +rfe = RFE(estimator=model, n_features_to_select=5) +fit = rfe.fit(X, y) + +# Print results +print(f"Num Features: {fit.n_features_}") +print(f"Selected Features: {X.columns[fit.support_]}") + +``` + +## 5. Identifying Multicollinearity + +One of the most important parts of feature selection is identifying features that are highly correlated with *each other* (not just the target). If `Feature A` and `Feature B` are 99% identical, you should drop one. + +We often use the **Variance Inflation Factor (VIF)** to detect this. A VIF score or usually indicates that a feature is redundant. + +## References for More Details + +* **[Scikit-Learn Feature Selection Module](https://scikit-learn.org/stable/modules/feature_selection.html):** Exploring `SelectKBest` and `VarianceThreshold`. +* **[Feature Selection Strategies (Article)](https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/):** Deciding which statistical test to use based on your data type. + +--- + +**Congratulations!** You have learned the entire Preprocessing Pipeline. From handling missing data to selecting the perfect features, your data is now "Model-Ready." \ No newline at end of file diff --git a/docs/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/handling-missing-data.mdx b/docs/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/handling-missing-data.mdx index e69de29..f74650f 100644 --- a/docs/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/handling-missing-data.mdx +++ b/docs/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/handling-missing-data.mdx @@ -0,0 +1,130 @@ +--- +title: Handling Missing Data +sidebar_label: Missing Data +description: "Techniques for identifying, analyzing, and resolving missing values in datasets using deletion and imputation strategies." +tags: [data-cleaning, preprocessing, pandas, imputation, statistics] +--- + +Data in the real world is rarely complete. Whether it's a sensor that went offline, a survey respondent who skipped a question, or a database merge that failed, you will encounter missing values. How you handle them can drastically change your model's performance. + +## 1. Why is Data Missing? + +Before fixing the data, you must understand the "mechanism" of the missingness. Statistics classifies this into three categories: + +1. **MCAR (Missing Completely at Random):** The missingness has no relationship with any data. (e.g., A random equipment malfunction). +2. **MAR (Missing at Random):** The missingness is related to other *observed* data. (e.g., Men are less likely to disclose their weight, but we know their gender). +3. **MNAR (Missing Not at Random):** The missingness is related to the missing value itself. (e.g., People with very high debt are less likely to report their debt levels). + +## 2. Detecting Missing Values + +Using Pandas, the first step is always to quantify the problem. + +```python +import pandas as pd + +# Load data +df = pd.read_csv('data.csv') + +# Count missing values per column +print(df.isnull().sum()) + +# Visualize missingness (highly recommended for large datasets) +import seaborn as sns +sns.heatmap(df.isnull(), cbar=False, yticklabels=False, cmap='viridis') + +``` + +## 3. Strategy 1: Deletion + +The simplest approach, but often the most dangerous. + +### A. Listwise Deletion (Drop Rows) + +Remove any row that has at least one missing value. + +* **When to use:** When the dataset is massive and missing values are rare (). +* **Risk:** You might throw away valuable information or introduce bias if the data isn't MCAR. + +### B. Dropping Columns + +Remove an entire feature if it has too many missing values (e.g., ). + +* **When to use:** When the feature is not critical for the target prediction. + +## 4. Strategy 2: Imputation + +Imputation is the process of "filling in" the holes with estimated values. + +### A. Statistical Imputation + +Replacing missing values with a central tendency measure. + +* **Mean:** Good for normally distributed numerical data. +* **Median:** Better for skewed data (robust to outliers). +* **Mode:** Used for categorical data. + +### B. Constant Value Imputation + +Filling with a specific value like `0`, `"Unknown"`, or `-999`. + +* **Note:** This tells the model that the data was missing, which itself can be a feature. + +### C. Advanced: Predictive Imputation + +Using other features to predict the missing value. + +* **K-Nearest Neighbors (KNN):** Finds the "most similar" rows and averages their values. +* **MICE (Multivariate Imputation by Chained Equations):** A sophisticated iterative process that models each feature as a function of others. + +```python +from sklearn.impute import SimpleImputer, KNNImputer + +# Simple Mean Imputation +imputer = SimpleImputer(strategy='mean') +df['Age'] = imputer.fit_transform(df[['Age']]) + +# KNN Imputation +knn_imputer = KNNImputer(n_neighbors=5) +df_filled = knn_imputer.fit_transform(df) + +``` + +## 5. Summary Table: Which Strategy to Pick? + +| Scenario | Best Action | +| --- | --- | +| **Randomly missing, very few rows** | Drop Rows (`dropna`) | +| **Numerical, Normal distribution** | Mean Imputation | +| **Numerical, Many outliers** | Median Imputation | +| **Categorical data** | Mode or "Unknown" category | +| **Critical feature, complex relationships** | KNN or MICE Imputation | + +## 6. The "Missingness Indicator" Trick + +Sometimes, the fact that data is missing is a signal in itself. You can create a new binary column: + +* `is_age_missing = 1` if Age is null, `0` otherwise. +This allows the model to learn if "not reporting age" correlates with the target. + +```mermaid +graph LR + Raw[Raw Feature] --> Split{Missing?} + Split -->|Yes| Flag[Add Binary Flag: 1] + Split -->|No| Keep[Value: X, Flag: 0] + Flag --> Impute[Fill Value with Median] + Impute --> Combined[Final Dataset] + Keep --> Combined + style Flag fill:#f3e5f5,stroke:#7b1fa2,color:#333 + +``` + +## References for More Details + +* **[Scikit-Learn Imputation Guide](https://scikit-learn.org/stable/modules/impute.html):** Technical implementation of KNN and Iterative Imputers. + + +* **[Feature Engineering for ML (Book)](https://www.oreilly.com/library/view/feature-engineering-for/9781491953235/):** Deep diving into why missing data affects model coefficients. + +--- + +Fixing missing data is the first step in cleaning. Next, we need to ensure that the numbers themselves are on a scale that our algorithms can actually understand. \ No newline at end of file diff --git a/docs/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/normalization.mdx b/docs/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/normalization.mdx index e69de29..de5e24a 100644 --- a/docs/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/normalization.mdx +++ b/docs/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/normalization.mdx @@ -0,0 +1,115 @@ +--- +title: Normalization Techniques +sidebar_label: Normalization +description: "A deep dive into Min-Max scaling, MaxAbs scaling, and Unit Vector normalization for bounded data ranges." +tags: [data-cleaning, preprocessing, normalization, min-max-scaling, machine-learning] +--- + +In Machine Learning, **Normalization** is the process of rescaling numeric variables to a strictly defined range most commonly $[0, 1]$ or $[-1, 1]$. Unlike standardization, which is about centered distributions, normalization is about **boundaries**. + +## 1. When is Normalization Essential? + +Normalization is preferred over standardization in specific scenarios: + +* **Image Processing:** Pixel intensities are naturally bounded between 0 and 255. Normalizing them to $[0, 1]$ is standard practice for Convolutional Neural Networks (CNNs). +* **Neural Networks:** Activation functions like *Sigmoid* or *Tanh* are most sensitive in small ranges around zero. +* **Algorithms with No Distribution Assumption:** When you don't know if your data is Gaussian (Normal), normalization is a safer, non-parametric starting point. + +## 2. Min-Max Scaling + +This is the most common form of normalization. It shifts and rescales the data so that the minimum value becomes 0 and the maximum value becomes 1. + +**The Formula:** + +$$ +x' = \frac{x - \text{min}(x)}{\text{max}(x) - \text{min}(x)} +$$ + +* **Pros:** Preserves the relative distances between values. +* **Cons:** Extremely sensitive to **outliers**. If you have one value at 10,000 and the rest at 10, the "normal" data will be squashed into a tiny range (e.g., $0.0001$). + +## 3. MaxAbs Scaling + +MaxAbs scaling divides each value by the maximum absolute value in the feature. This scales the data to the range **$[-1, 1]$**. + +**The Formula:** + +$$ +x' = \frac{x}{|\text{max}(x)|} +$$ + +* **Best Use Case:** Sparse data (data with many zeros). It does not "shift" the data (it doesn't subtract the mean or min), so it **preserves sparsity**. +* **Common in:** Text analytics and TF-IDF vectors. + +## 4. Robust Normalization (Quantile Scaling) + +If your data has significant outliers, Min-Max scaling will fail. A "Robust" approach uses the Interquartile Range (IQR). + +**The Formula:** + +$$ +x' = \frac{x - Q_1(x)}{Q_3(x) - Q_1(x)} +$$ + +## 5. Comparison: Normalization vs. Standardization + +| Feature | Normalization (Min-Max) | Standardization (Z-Score) | +| :--- | :--- | :--- | +| **Range** | Fixed $[0, 1]$ or $[-1, 1]$ | Not bounded (usually $[-3, 3]$) | +| **Mean/Sigma** | Varies | Mean = 0, Std Dev = 1 | +| **Outliers** | Highly Affected | Less Affected | +| **Best For** | Neural Networks, Images | Linear Reg, SVM, PCA | + +## 6. Practical Implementation + +Using `scikit-learn`, we can apply these transformations efficiently. + +```python +from sklearn.preprocessing import MinMaxScaler, MaxAbsScaler + +# Sample Data: Age and Salary +data = [[25, 50000], [30, 80000], [45, 120000]] + +# Min-Max Scaling to [0, 1] +min_max = MinMaxScaler() +normalized_data = min_max.fit_transform(data) + +# MaxAbs Scaling (Preserves Zeros) +max_abs = MaxAbsScaler() +sparse_friendly_data = max_abs.fit_transform(data) + +``` + +## 7. Mathematical Visualisation + +```mermaid +graph LR + subgraph Raw [Raw Data] + D1[0...10...100] + end + + subgraph Norm [Normalized] + N1[0...0.1...1.0] + end + + subgraph Std [Standardized] + S1[-1.5...0...+1.5] + end + + Raw -->|Min-Max| Norm + Raw -->|Z-Score| Std + + style Norm fill:#e1f5fe,stroke:#01579b,color:#333 + style Std fill:#f3e5f5,stroke:#7b1fa2,color:#333 + +``` + +## References for More Details + +* **[Scikit-Learn Normalization Guide](https://scikit-learn.org/stable/modules/preprocessing.html#normalization):** Understanding `Normalizer` vs `MinMaxScaler`. + +* **[Google Machine Learning Crash Course](https://developers.google.com/machine-learning/data-prep/transform/normalization):** Visualizing how normalization helps loss functions converge. + +--- + +**Normalization handles the scale of your numbers, but what if you have too many features? Excess features can confuse a model and lead to "The Curse of Dimensionality."** diff --git a/docs/machine-learning/machine-learning-core/introduction-to-ml.mdx b/docs/machine-learning/machine-learning-core/introduction-to-ml.mdx index e69de29..0be84bc 100644 --- a/docs/machine-learning/machine-learning-core/introduction-to-ml.mdx +++ b/docs/machine-learning/machine-learning-core/introduction-to-ml.mdx @@ -0,0 +1,106 @@ +--- +title: "What is Machine Learning?" +sidebar_label: Introduction +description: "Understanding the paradigm shift from traditional programming to data-driven learning." +tags: [machine-learning, introduction, ai-basics, fundamentals] +--- + +At its simplest, **Machine Learning (ML)** is the field of study that gives computers the ability to learn without being explicitly programmed. Instead of a human writing a thousand "if-then" statements, we provide an algorithm with data, and the algorithm "finds" the patterns itself. + +## 1. The Paradigm Shift + +To understand ML, we must compare it to **Traditional Programming**. + +### Traditional Programming +In traditional software engineering, a human provides the **Rules** (code) and the **Data**. The computer follows the rules to produce an **Output**. + +### Machine Learning +In ML, we provide the **Data** and the **Output** (labels). The computer analyzes these to produce the **Rules** (the Model). + +```mermaid +graph TD + subgraph Traditional ["Traditional Programming"] + Data1[Data] --> Logic[Rules/Code] + Logic --> Out1[Output] + end + + subgraph ML ["Machine Learning"] + Data2[Data] --> Answer[Expected Output] + Answer --> Learn[Learning Algorithm] + Learn --> Model[Rules/The Model] + end + + style Logic fill:#f5f5f5,stroke:#333,color:#333 + style Model fill:#e1f5fe,stroke:#01579b,color:#333 + +``` + +## 2. The Three Main Types of Learning + +Machine Learning is generally divided into three main categories based on how the agent "learns." + +### A. Supervised Learning + +The model is trained on **labeled data**. You give it inputs and the correct answers. It’s like a student learning with a teacher who corrects their homework. + +* **Regression:** Predicting a continuous number (e.g., Home prices). +* **Classification:** Predicting a category (e.g., Is this email Spam or Not Spam?). + +### B. Unsupervised Learning + +The model is given **unlabeled data** and must find hidden structures or patterns on its own. There is no "teacher." + +* **Clustering:** Grouping customers by similar buying habits. +* **Association:** Finding that people who buy bread also tend to buy butter. + +### C. Reinforcement Learning (RL) + +The model (agent) learns by interacting with an environment. It receives **rewards** for good actions and **penalties** for bad ones. It’s how AI learns to play chess or drive autonomous cars. + +## 3. The Core Ingredients of ML + +Every Machine Learning problem requires three components: + +1. **The Dataset:** High-quality, representative data. +2. **The Features:** The specific attributes or variables the model looks at (e.g., mileage, year, and brand for a car). +3. **The Algorithm:** The mathematical process used to find patterns (e.g., Linear Regression, Neural Networks). + +## 4. The Lifecycle of an ML Project + +Building a model isn't just writing code; it's a circular process: + +1. **Define the Goal:** What are we trying to predict? +2. **Data Collection:** Gathering raw information. +3. **Data Preprocessing:** Cleaning and scaling (what you learned in the [Data Engineering module](/tutorial/category/data-engineering-basics)). +4. **Model Training:** Feeding data to the algorithm. +5. **Evaluation:** Testing the model on data it hasn't seen before. +6. **Deployment:** Putting the model into a real-world app. + +```mermaid +stateDiagram-v2 + [*] --> Collection + Collection --> Preprocessing + Preprocessing --> Training + Training --> Evaluation + Evaluation --> Deployment + Deployment --> Collection : Feedback Loop + +``` + +## 5. When NOT to use Machine Learning + +ML is powerful, but it isn't always the right tool. Avoid ML if: + +* You have very little data. +* The problem can be solved with simple, static logic. +* You need 100% mathematical certainty (ML is probabilistic, not deterministic). + +## References for More Details + +* **[Elements of AI (Free Course)](https://www.elementsofai.com/):** A non-technical conceptual deep dive. + +* **[Google Machine Learning Glossary](https://developers.google.com/machine-learning/glossary):** Quickly looking up confusing terminology. + +--- + +**Now that you understand the "Big Picture," let's look at the most fundamental math behind almost every predictive model.** \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/types-of-machine-learning/supervised-learning.mdx b/docs/machine-learning/machine-learning-core/types-of-machine-learning/supervised-learning.mdx index e69de29..1990938 100644 --- a/docs/machine-learning/machine-learning-core/types-of-machine-learning/supervised-learning.mdx +++ b/docs/machine-learning/machine-learning-core/types-of-machine-learning/supervised-learning.mdx @@ -0,0 +1,93 @@ +--- +title: "Supervised Learning: Learning with Labels" +sidebar_label: Supervised Learning +description: "A deep dive into supervised learning: regression, classification, and the relationship between features and targets." +tags: [machine-learning, supervised-learning, regression, classification, fundamentals] +--- + +**Supervised Learning** is the most widely used branch of Machine Learning. It is called "supervised" because the process of an algorithm learning from the training dataset can be thought of as a teacher supervising the learning process. + +In this paradigm, we provide the computer with **Inputs** (features) and the correct **Answers** (labels). The goal is for the model to learn a general rule that maps inputs to outputs. + +## 1. The Mathematical Core + +At its heart, supervised learning is about finding a function $f$ that maps input variables ($X$) to an output variable ($y$). + +$$ +y = f(X) + \epsilon +$$ + +* **$y$**: The Target (the value we want to predict). +* **$X$**: The Features (the data we use to make the prediction). +* **$f$**: The Model (the mapping function learned by the algorithm). +* **$\epsilon$**: Error/Noise (random variability that the model cannot predict). + +## 2. The Two Main Branches + +Supervised learning is divided based on the **nature of the target variable**. + +### A. Regression (Continuous Values) +Regression is used when the output variable is a **real or continuous value**. You are predicting a "how much" or "how many." +* **Example:** Predicting the price of a house based on its square footage. +* **Example:** Predicting the temperature for tomorrow. + +### B. Classification (Discrete Categories) +Classification is used when the output variable is a **category or label**. You are predicting "which one." +* **Binary Classification:** Only two possible classes (e.g., Spam or Not Spam). +* **Multi-class Classification:** More than two classes (e.g., Identifying if an image is a Cat, Dog, or Bird). + +## 3. The Supervised Learning Workflow + +The process of training a supervised model follows a strict sequence: + +```mermaid +graph LR + Data[Labeled Dataset] --> Split{Split} + Split --> Train[Training Set] + Split --> Test[Test Set] + + Train --> Algo[ML Algorithm] + Algo --> Model[Trained Model] + + Test --> Model + Model --> Eval[Performance Metrics] + + style Train fill:#e8f5e9,stroke:#2e7d32,color:#333 + style Test fill:#fff3e0,stroke:#ef6c00,color:#333 + +``` + +1. **Data Labeling:** Ensuring every row of data has a known "ground truth." +2. **Feature Selection:** Choosing which attributes are relevant. +3. **Training:** The algorithm looks at the training data and adjusts its internal parameters to minimize error. +4. **Prediction:** We feed the model new, unseen data () and it generates a prediction (). + +## 4. Common Supervised Algorithms + +Depending on the complexity of the data, we choose different "learners": + +| Algorithm | Type | Use Case | +| --- | --- | --- | +| **Linear Regression** | Regression | Predicting sales or house prices. | +| **Logistic Regression** | Classification | Predicting if a transaction is fraudulent (Yes/No). | +| **Decision Trees** | Both | Creating "if-this-then-that" logic for credit scoring. | +| **Support Vector Machines (SVM)** | Both | High-dimensional classification like facial recognition. | +| **Neural Networks** | Both | Complex tasks like image and speech recognition. | + +## 5. Challenges: Overfitting and Underfitting + +The biggest hurdle in supervised learning is ensuring the model generalizes well to **new** data. + +* **Underfitting:** The model is too simple to capture the underlying pattern (High Bias). +* **Overfitting:** The model learns the noise in the training data too well and fails to predict new data (High Variance). + +![Comparing underfitting, balanced fit, and overfitting on a dataset](/img/tutorials/ml/comparing-underfitting-balanced-fit-and-overfitting.jpg) + +## References for More Details + +* **[Scikit-Learn Supervised Learning Guide](https://scikit-learn.org/stable/supervised_learning.html):** Code examples for every major algorithm. +* **[Machine Learning Mastery - Supervised Learning](https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/):** A simple, beginner-friendly explanation of the differences. + +--- + +**Now that you understand how models learn from labeled data, let's explore the opposite: finding hidden patterns in data where the "answers" aren't provided.** \ No newline at end of file diff --git a/static/img/tutorials/ml/comparing-underfitting-balanced-fit-and-overfitting.jpg b/static/img/tutorials/ml/comparing-underfitting-balanced-fit-and-overfitting.jpg new file mode 100644 index 0000000..6c5a63d Binary files /dev/null and b/static/img/tutorials/ml/comparing-underfitting-balanced-fit-and-overfitting.jpg differ diff --git a/static/img/tutorials/ml/random-forest-feature-importance.jpg b/static/img/tutorials/ml/random-forest-feature-importance.jpg new file mode 100644 index 0000000..cbc87e7 Binary files /dev/null and b/static/img/tutorials/ml/random-forest-feature-importance.jpg differ diff --git a/static/img/tutorials/ml/recursive-feature-elimination.jpg b/static/img/tutorials/ml/recursive-feature-elimination.jpg new file mode 100644 index 0000000..62e4d9e Binary files /dev/null and b/static/img/tutorials/ml/recursive-feature-elimination.jpg differ