Every year, thousands of freshers apply to fake job postings β losing money, time, and sometimes personal data.
Most detection tools give a binary answer: real or fake. That is not enough. A fresher needs to know how risky and exactly why.
Paste any job posting β get an explainable fraud risk score in seconds.
- Risk score 0β100 with LOW / MEDIUM / HIGH classification
- Exact SHAP attribution β shows which words/signals drove the prediction
- Behavioral signals β urgency language, free email, missing salary, scam phrases
- Rule-based flags β transparent, human-readable explanation alongside ML output
Raw Job Posting (title + description + company + salary)
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β utils.py (runtime brain) β
β β
β build_feature_vector() β
β βββ TF-IDF transform β 5000 dims β
β βββ Behavioral features β 3 dims β
β total β 5003 dims β
β β
β fraud_model.predict_proba() β P(fraud) β
β compute_risk_score() β 0β100 β
β compute_shap_values() β Οα΅’ exact β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
app.py (Streamlit UI)
βββ app.py # Streamlit UI β deployed on Streamlit Cloud
βββ utils.py # Runtime brain β all ML logic
βββ 02_feature_engineering_and_model.py # Training script β run once offline
βββ eda.py # EDA β origin of behavioral signals
βββ expainabiity_and_insights.py # Coef-based SHAP formula discovery
βββ fraud_model.pkl # Trained Logistic Regression model
βββ tfidf_vectorizer.pkl # Fitted TF-IDF vectorizer (5000 features)
βββ feature_names.pkl # Feature name list (5003 names)
βββ requirements.txt # Dependencies
Training pipeline:
eda.py β expainabiity_and_insights.py β 02_feature_engineering_and_model.py
β β β
signals SHAP formula .pkl artifacts
(all copied into utils.py)
4 algorithms were trained on the same 5003-dim feature space. Logistic Regression was selected for mathematically exact SHAP β the AUC difference vs XGBoost is only 0.005, making interpretability the decisive factor.
| Model | Test AUC | F1 (Fraud) | CV AUC (5-fold) | Selected |
|---|---|---|---|---|
| Logistic Regression | 0.9800 | 0.88 | 0.96 Β± 0.01 | β Yes |
| XGBoost | 0.9750 | 0.86 | 0.95 Β± 0.01 | β |
| Gradient Boosting | 0.9710 | 0.85 | 0.94 Β± 0.01 | β |
| Random Forest | 0.9680 | 0.84 | 0.94 Β± 0.02 | β |
Why LR over XGBoost?
LR gives exact SHAP via Οα΅’ = coef[i] Γ feature_value[i] β no TreeSHAP approximation needed. AUC gap is negligible; interpretability is not.
Why not overfit?
5-fold stratified CV AUC = 0.96 Β± 0.01 β consistent with test AUC, confirming no data leakage. class_weight='balanced' handles the ~5% fraud rate.
TfidfVectorizer(max_features=5000, stop_words='english')
combined_text = title + description + requirements| Feature | Origin | Signal |
|---|---|---|
desc_length |
eda.py |
Short descriptions = higher risk |
urgency_score |
eda.py |
Count of urgency words |
free_email |
eda.py |
Gmail/Yahoo in company contact |
risk_score = (0.60 Γ adj_prob
+ 0.15 Γ urgency_norm
+ 0.15 Γ salary_missing
+ 0.10 Γ free_email) Γ 100
Exact SHAP for Logistic Regression β no approximation, no external shap library:
Οα΅’ = coef[i] Γ feature_value[i] # log-odds contribution
log_odds = intercept + Ξ£ Οα΅’ # reconstructs model output exactly
P(fraud) = sigmoid(log_odds) # integrity check: β exact matchEMSCAD β Employment Scam Aegean Dataset
- ~18,000 job postings
- ~5% fraudulent (imbalanced β handled via
class_weight='balanced') - Features: title, description, company_profile, requirements, salary_range
# 1. Clone
git clone https://github.com/AkashMs24/explainable-job-scam-risk-detection-system-.git
cd explainable-job-scam-risk-detection-system-
# 2. Install dependencies
pip install -r requirements.txt
# 3. Run Streamlit app
streamlit run app.py| Layer | Technology |
|---|---|
| Language | Python 3.9+ |
| ML | scikit-learn, XGBoost |
| NLP | TF-IDF (sklearn) |
| Explainability | Exact SHAP (custom, no shap library) |
| UI | Streamlit |
| Serialization | joblib |
| Data | pandas, numpy, scipy |
1. Why Logistic Regression? Exact SHAP without approximation. AUC gap vs XGBoost is 0.005 β interpretability wins.
2. Why exact SHAP instead of the shap library?
For linear models, Οα΅’ = coef[i] Γ feature_value[i] is mathematically exact. No dependency, no approximation, faster, and verifiable with an integrity check.
3. Why a composite risk score instead of raw probability? Raw probability ignores behavioral red flags (missing salary, free email, urgency). The composite score captures both ML signal and domain knowledge from EDA.
4. Why 0.35 threshold instead of 0.5? Optimized for recall β catching fraud is more important than avoiding false alarms when the cost of a missed scam is high (financial loss, data theft).