A production-style research project combining ML-based return forecasting, CVaR-aware portfolio optimization, pair-trading, realistic transaction costs, and a semantic NLP event pipeline for S&P 500 equities.
Text Sources Market / Macro Data
(Reddit, RSS, Twitter, mock) (yfinance / demo fallback)
│ │
▼ ▼
NLP Pipeline Feature Engineering
┌──────────────┐ ┌────────────────────┐
│ entity link │ │ technical features │
│ relevance │────────▶ │ cross-sectional │
│ sentiment │ │ macro features │
│ event detect │ │ NLP features │
└──────────────┘ └────────┬───────────┘
│
▼
ML Model Layer
(XGBoost / RF / LSTM / Transformer / Ensemble)
│
┌────────────────┴─────────────────┐
▼ ▼
Directional Sleeve Pair-Trading Sleeve
CVaR Optimizer (cvxpy) Cointegration / Z-score
│ │
└──────────────┬────────────────────┘
▼
Sleeve Allocator
(regime-aware weights)
│
▼
Backtest Engine
(walk-forward, next-bar fill)
│
┌──────────────┼──────────────┐
▼ ▼ ▼
Broker Metrics Attribution
($10/trade fee) (Sharpe, CVaR) (sleeve P&L)
│
▼
Reports / Artifacts
(HTML, CSV, PNG, Jupyter)
git clone <repo_url>
cd project
python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS/Linux
source .venv/bin/activate
pip install -r requirements.txt
pip install -e .cp .env.example .env
# Edit .env with your keysThe system runs fully without any API keys using synthetic demo data.
python -m src.cli generate-demo# 1. Generate synthetic prices + NLP signals
python -m src.cli generate-demo
# 2. Build feature store from demo data
python -m src.cli build-dataset --use-demo
# 3. Build NLP features (uses demo NLP data)
python -m src.cli build-nlp-features --use-demo
# 4. Train model
python -m src.cli train-model
# 5. Select cointegrated pairs
python -m src.cli select-pairs
# 6. Run backtest
python -m src.cli run-backtest --output reports/backtest
# 7. Run ablations
python -m src.cli run-ablation --output reports/ablations
# 8. Generate final report
python -m src.cli generate-report --output reports/finalpython -m src.cli build-dataset # downloads via yfinance (no key needed)
python -m src.cli build-nlp-features # requires REDDIT_CLIENT_ID etc. in .envTransaction costs are modelled in src/portfolio/costs.py with three components:
| Component | Default | Description |
|---|---|---|
| Fixed fee | $10.00 per order | Charged for every buy or sell on a single ticker |
| Bid-ask half-spread | 5 bps | Half the bid-ask spread, applied on fill |
| Slippage / market impact | 3 bps | Estimated execution slippage |
Counting rules:
- Selling AAPL + buying MSFT = 2 orders = $20 fixed fees
- Pair trades (long A + short B) = 2 orders = $20
- Fees are tracked separately and reported as "fee drag" in metrics
All cost parameters are configurable in configs/backtest.yaml.
The semantic event pipeline (src/nlp/) ingests public text and converts it into
ticker-level market features.
| Source | Connector | Requires |
|---|---|---|
| PRAW | REDDIT_CLIENT_ID + REDDIT_CLIENT_SECRET |
|
| Twitter/X | Bearer token | TWITTER_BEARER_TOKEN |
| RSS/News | feedparser | None (public feeds) |
| Mock/Demo | JSON file | None |
- Entity linking — maps text to tickers, sectors, themes using a rule dictionary (e.g. "Trump" → tariffs/EV/China exposure; "Musk" → TSLA/AI/robotaxi) with embedding-similarity fallback for unlisted entities
- Relevance scoring — cosine similarity of post embedding to ticker embedding; optional zero-shot classifier
- Sentiment scoring — FinBERT (
ProsusAI/finbert) returns positive/negative/neutral with confidence; influencer accounts receive boosted weight - Event detection — categorises posts into earnings, product_launch, regulation, tariff_trade, macro_policy, ceo_statement, political_statement, etc.
- Feature generation — produces daily, ticker-aligned features:
- rolling sentiment mean (1d, 3d, 7d)
- mention volume spike (z-score)
- influencer-weighted sentiment
- recency-decayed relevance score
- event novelty score
All features are timestamped at market close and shifted by 1 bar to prevent lookahead bias.
"Trump announces 25% tariffs on Chinese EVs"
→ tickers: [TSLA, XPEV, NIO], sectors: [Consumer Discretionary, Technology]
→ themes: [tariffs, trade_policy, china], event: tariff_trade
"Elon Musk: FSD v12 rolling out to all North America this quarter"
→ tickers: [TSLA], themes: [fsd, self_driving, product_launch]
→ sentiment: positive, confidence: 0.87, event: product_launch
All behaviour is controlled by YAML files in configs/:
| File | Controls |
|---|---|
base.yaml |
Universe, date range, rebalance frequency, sleeve weights |
data.yaml |
Data provider, feature engineering parameters |
model.yaml |
Active model, hyperparameters, walk-forward splits |
backtest.yaml |
Execution assumptions, fees, optimizer params, ablation list |
nlp.yaml |
Source connectors, entity dict, sentiment model, feature params |
The backtest engine computes:
- CAGR, annualised volatility
- Sharpe ratio, Sortino ratio, Calmar ratio
- Maximum drawdown
- Hit rate (fraction of correct direction calls)
- Turnover (annualised)
- Beta to SPY
- CVaR at 95%
- Gross return vs net return
- Transaction fee drag (fixed + spread + slippage)
- Sleeve attribution (directional vs pair)
- Factor / sector exposure summary
run-ablation executes 6 pre-defined scenarios defined in configs/backtest.yaml:
- no_nlp — disable NLP features
- no_pairs — disable pair-trading sleeve
- no_cvar — replace CVaR optimizer with mean-variance
- fee_only — fixed fee only, no spread/slippage
- daily_rebalance — daily vs weekly frequency
- tree_only — XGBoost-only (no LSTM/ensemble)
- Prices are corporate-action adjusted (splits + dividends via yfinance)
- No survivorship bias guard in demo mode; production universe uses point-in-time SP500 membership
- All features are computed on close prices and shifted by 1 trading day before use as model inputs
- NLP features are stamped at 16:00 ET and aligned to the next market open
- Survivorship bias — demo universe uses current SP500 tickers; production should use historical membership snapshots
- Twitter/X connector — Academic API access is required; stub returns empty list without key
- Truth Social — No official API; web ingestion is marked as TODO pending legal review
- FinBERT inference — GPU strongly recommended for large post volumes; CPU is supported but slow
- CVXPY solver — Falls back to SCS if MOSEK/Clarabel is unavailable; may be slower for large universes
- Options data — True implied volatility is not available via free APIs; realized vol is used as proxy
project/
README.md
requirements.txt
pyproject.toml
.env.example
configs/ base.yaml | data.yaml | model.yaml | backtest.yaml | nlp.yaml
data/
raw/ downloaded raw files
processed/ features.parquet, nlp_features.parquet, pairs.json
cache/ yfinance cache, demo_prices.parquet, demo_nlp.json
src/
cli.py Click CLI entry-point
main.py Programmatic pipeline orchestrator
utils/ config, logging, schemas, demo_data
data/ universe, market_data, macro_data, feature_store
features/ technical, cross_sectional, macro_features, nlp_features
models/ base_model, xgb, rf, lstm, transformer, regime, ensemble
statarb/ pair_selection, cointegration, spread_signals, pair_portfolio
nlp/ source_connectors, entity_linking, relevance, sentiment, events
portfolio/ optimizer, constraints, sleeve_allocator, risk_model, costs
backtest/ engine, broker, execution, metrics, walk_forward, attribution
baselines/ equal_weight, buy_and_hold, logistic, arima, risk_parity
reports/ tables, charts, export
tests/
notebooks/
00_demo.ipynb End-to-end walkthrough on synthetic data