The API currently uses Model 3rd, which was selected as the production model because it achieved the highest test accuracy (98.83%) from the 80/20 split and superior performance on two separate manual test cases among all three model iterations.
- Model 1st: 3,000 URLs (1,500 benign + 1,500 malicious) - balanced dataset
- Model 2nd: 3,000 URLs (1,500 benign + 1,500 malicious) - new balanced dataset (retrained)
- Model 3rd: 6,000 URLs (3,000 benign + 3,000 malicious) - new balanced dataset (retrained)
- Source: manualCollection and from numerous dataset mentioned in academic papers
- Features: 38 features (36 lexical + 2 host-based)
- Train/Test Split: 80/20 ratio (80% training, 20% testing)
- Source: manualCollection and from academic mentioned datasets
Note: These are additional manual test cases used to evaluate the final saved model performance beyond the standard 80/20 train/test split validation.
- Size: 77 URLs (all benign)
- Purpose: Benign model performance evaluation on completely unseen real-world data
- Size: 675 URLs
- 123 benign URLs
- 552 malicious URLs
- Purpose: Overall model performance evaluation on completely unseen real-world data
| Model Version | Test Accuracy (20% split) | Test Set 1 (77 Benign) | Test Set 2 (675 Mixed) |
|---|---|---|---|
| Model 1st | 98.00% | 66 benign, 11 misclassified | 657 correct (534 malicious + 123 benign) |
| Model 2nd | 98.67% | 58 benign, 19 misclassified | 662 correct (539 malicious + 123 benign) |
| Model 3rd ⭐ | 98.83% | 68 benign, 9 misclassified | 669 correct (546 malicious + 123 benign) |
- Training Accuracy: 98.00%
- Test Set 1 Performance: 66/77 benign correctly classified (85.7% benign precision)
- Test Set 2 Performance: 534/552 malicious detected (96.7% malicious recall)
- Training Accuracy: 98.67%
- Test Set 1 Performance: 58/77 benign correctly classified (75.3% benign precision)
- Test Set 2 Performance: 539/552 malicious detected (97.6% malicious recall)
- Training Accuracy: 98.83%
- Test Set 1 Performance: 68/77 benign correctly classified (88.3% benign precision)
- Test Set 2 Performance: 546/552 malicious detected (98.9% malicious recall)
- High Training Accuracy: 98.83%
- Excellent Malicious Detection: 98.9% recall on test set
- Balanced Performance: Good performance on both benign and malicious URLs
- Progressive Improvement: Each model iteration showed improved training accuracy
- Algorithm: XGBoost Classifier
- Feature Set: 38 carefully engineered features
- Training Approach: Balanced datasets (50% benign, 50% malicious)
- Final Training Data: 6,000 URLs (3,000 benign + 3,000 malicious)
- File Format: Serialized as .json using joblib
The model analyzes 38 features extracted from URLs:
- URL structure analysis
- Character pattern recognition
- String composition metrics
- Domain and path characteristics
- Domain registration information
- Domain expiration data
For detailed feature documentation, see FEATURES.md.
Feature correlation matrix showing the correlation patterns between 39 features, which informed the decision to exclude TLD features due to high correlation with other lexical features. This analysis led to the final 38 features used in the model. When features are highly correlated, they provide similar information, which can confuse the model and make it harder to determine which features are truly important. Removing highly correlated features reduces model complexity and improves performance by eliminating redundant information.
Top features contributing to Model 3rd classification decisions, showing the most influential URL characteristics.
Confusion matrix for Model 3rd showing classification performance on test data.
Normalized confusion matrix showing percentage-based performance metrics.
The model underwent a systematic retraining process with progressively larger balanced datasets:
-
1st Model: Initial training with 3,000 URLs (1,500 benign + 1,500 malicious)
- Performance: 98.00% accuracy
- Dataset: Balanced 50/50 split
-
2nd Model: Retrained with new 3,000 URLs (1,500 benign + 1,500 malicious)
- Performance: 98.67% accuracy (0.67% improvement)
- Dataset: Fresh balanced 50/50 split
- Approach: Same size, different data for robustness
-
3rd Model: Expanded training with 6,000 URLs (3,000 benign + 3,000 malicious)
- Performance: 98.83% accuracy (0.16% improvement)
- Dataset: Doubled dataset size, maintained 50/50 balance
- Approach: Larger dataset for better generalization
- Balanced Datasets: All models trained with equal benign/malicious samples
- Progressive Scaling: 3k → 3k → 6k URL progression
- Consistent Improvement: Each iteration achieved higher accuracy
- Data Variety: Model 2nd used fresh data, Model 3rd used larger fresh dataset
Last Updated: August 2025 Model Version: 3rd (Current Production Model)