Skip to content

AkashMs24/Explainable-Job-Scam-Risk-Detection-System-

Repository files navigation

πŸ›‘οΈ SCAMGUARD AI

Explainable Job Scam Risk Detection using NLP & Machine Learning

Live App NLP Model Python Dataset


The Problem

Every year, thousands of freshers apply to fake job postings β€” losing money, time, and sometimes personal data.

Most detection tools give a binary answer: real or fake. That is not enough. A fresher needs to know how risky and exactly why.


What ScamGuard-AI Does

Paste any job posting β†’ get an explainable fraud risk score in seconds.

  • Risk score 0–100 with LOW / MEDIUM / HIGH classification
  • Exact SHAP attribution β€” shows which words/signals drove the prediction
  • Behavioral signals β€” urgency language, free email, missing salary, scam phrases
  • Rule-based flags β€” transparent, human-readable explanation alongside ML output

Architecture

Raw Job Posting (title + description + company + salary)
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              utils.py  (runtime brain)       β”‚
β”‚                                             β”‚
β”‚  build_feature_vector()                     β”‚
β”‚    β”œβ”€β”€ TF-IDF transform     β†’ 5000 dims     β”‚
β”‚    └── Behavioral features  β†’    3 dims     β”‚
β”‚              total          β†’ 5003 dims     β”‚
β”‚                                             β”‚
β”‚  fraud_model.predict_proba()  β†’ P(fraud)    β”‚
β”‚  compute_risk_score()         β†’ 0–100       β”‚
β”‚  compute_shap_values()        β†’ Ο†α΅’ exact    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
   app.py (Streamlit UI)

File Structure

β”œβ”€β”€ app.py                               # Streamlit UI β€” deployed on Streamlit Cloud
β”œβ”€β”€ utils.py                             # Runtime brain β€” all ML logic
β”œβ”€β”€ 02_feature_engineering_and_model.py  # Training script β€” run once offline
β”œβ”€β”€ eda.py                               # EDA β€” origin of behavioral signals
β”œβ”€β”€ expainabiity_and_insights.py         # Coef-based SHAP formula discovery
β”œβ”€β”€ fraud_model.pkl                      # Trained Logistic Regression model
β”œβ”€β”€ tfidf_vectorizer.pkl                 # Fitted TF-IDF vectorizer (5000 features)
β”œβ”€β”€ feature_names.pkl                    # Feature name list (5003 names)
└── requirements.txt                     # Dependencies

Training pipeline:

eda.py  β†’  expainabiity_and_insights.py  β†’  02_feature_engineering_and_model.py
  ↓                    ↓                              ↓
signals            SHAP formula               .pkl artifacts
              (all copied into utils.py)

Model Benchmarking

4 algorithms were trained on the same 5003-dim feature space. Logistic Regression was selected for mathematically exact SHAP β€” the AUC difference vs XGBoost is only 0.005, making interpretability the decisive factor.

Model Test AUC F1 (Fraud) CV AUC (5-fold) Selected
Logistic Regression 0.9800 0.88 0.96 Β± 0.01 βœ… Yes
XGBoost 0.9750 0.86 0.95 Β± 0.01 β€”
Gradient Boosting 0.9710 0.85 0.94 Β± 0.01 β€”
Random Forest 0.9680 0.84 0.94 Β± 0.02 β€”

Why LR over XGBoost? LR gives exact SHAP via Ο†α΅’ = coef[i] Γ— feature_value[i] β€” no TreeSHAP approximation needed. AUC gap is negligible; interpretability is not.

Why not overfit? 5-fold stratified CV AUC = 0.96 Β± 0.01 β€” consistent with test AUC, confirming no data leakage. class_weight='balanced' handles the ~5% fraud rate.


Feature Engineering

Text Features (5000 dims)

TfidfVectorizer(max_features=5000, stop_words='english')
combined_text = title + description + requirements

Behavioral Features (3 dims)

Feature Origin Signal
desc_length eda.py Short descriptions = higher risk
urgency_score eda.py Count of urgency words
free_email eda.py Gmail/Yahoo in company contact

Risk Score Formula (Β§10)

risk_score = (0.60 Γ— adj_prob
            + 0.15 Γ— urgency_norm
            + 0.15 Γ— salary_missing
            + 0.10 Γ— free_email) Γ— 100

SHAP Explainability

Exact SHAP for Logistic Regression β€” no approximation, no external shap library:

Ο†α΅’ = coef[i] Γ— feature_value[i]      # log-odds contribution
log_odds = intercept + Ξ£ Ο†α΅’           # reconstructs model output exactly
P(fraud) = sigmoid(log_odds)           # integrity check: βœ“ exact match

Dataset

EMSCAD β€” Employment Scam Aegean Dataset

  • ~18,000 job postings
  • ~5% fraudulent (imbalanced β€” handled via class_weight='balanced')
  • Features: title, description, company_profile, requirements, salary_range

How to Run Locally

# 1. Clone
git clone https://github.com/AkashMs24/explainable-job-scam-risk-detection-system-.git
cd explainable-job-scam-risk-detection-system-

# 2. Install dependencies
pip install -r requirements.txt

# 3. Run Streamlit app
streamlit run app.py

Tech Stack

Layer Technology
Language Python 3.9+
ML scikit-learn, XGBoost
NLP TF-IDF (sklearn)
Explainability Exact SHAP (custom, no shap library)
UI Streamlit
Serialization joblib
Data pandas, numpy, scipy

Key Design Decisions

1. Why Logistic Regression? Exact SHAP without approximation. AUC gap vs XGBoost is 0.005 β€” interpretability wins.

2. Why exact SHAP instead of the shap library? For linear models, Ο†α΅’ = coef[i] Γ— feature_value[i] is mathematically exact. No dependency, no approximation, faster, and verifiable with an integrity check.

3. Why a composite risk score instead of raw probability? Raw probability ignores behavioral red flags (missing salary, free email, urgency). The composite score captures both ML signal and domain knowledge from EDA.

4. Why 0.35 threshold instead of 0.5? Optimized for recall β€” catching fraud is more important than avoiding false alarms when the cost of a missed scam is high (financial loss, data theft).


Not a substitute for manual verification Β· Trained on EMSCAD dataset

About

πŸ›‘οΈ SCAMGUARD-AI Explainable Job Scam Risk Detection System using NLP & Machine Learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages