Skip to content

HanSun103/AI-Driven-Portfolio-Optimization-Backtesting-System

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI-Driven Portfolio Optimization & Backtesting System

A production-style research project combining ML-based return forecasting, CVaR-aware portfolio optimization, pair-trading, realistic transaction costs, and a semantic NLP event pipeline for S&P 500 equities.


Architecture

Text Sources                  Market / Macro Data
(Reddit, RSS, Twitter, mock)  (yfinance / demo fallback)
        │                              │
        ▼                              ▼
  NLP Pipeline              Feature Engineering
  ┌──────────────┐          ┌────────────────────┐
  │ entity link  │          │ technical features │
  │ relevance    │────────▶ │ cross-sectional    │
  │ sentiment    │          │ macro features     │
  │ event detect │          │ NLP features       │
  └──────────────┘          └────────┬───────────┘
                                     │
                                     ▼
                             ML Model Layer
                    (XGBoost / RF / LSTM / Transformer / Ensemble)
                                     │
                    ┌────────────────┴─────────────────┐
                    ▼                                   ▼
           Directional Sleeve                 Pair-Trading Sleeve
           CVaR Optimizer (cvxpy)             Cointegration / Z-score
                    │                                   │
                    └──────────────┬────────────────────┘
                                   ▼
                          Sleeve Allocator
                         (regime-aware weights)
                                   │
                                   ▼
                         Backtest Engine
                    (walk-forward, next-bar fill)
                                   │
                    ┌──────────────┼──────────────┐
                    ▼              ▼              ▼
              Broker         Metrics         Attribution
           ($10/trade fee)   (Sharpe, CVaR)  (sleeve P&L)
                                   │
                                   ▼
                            Reports / Artifacts
                         (HTML, CSV, PNG, Jupyter)

Setup

1. Clone and install

git clone <repo_url>
cd project
python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS/Linux
source .venv/bin/activate

pip install -r requirements.txt
pip install -e .

2. Configure API keys (optional)

cp .env.example .env
# Edit .env with your keys

The system runs fully without any API keys using synthetic demo data.

3. Generate demo data (no API keys required)

python -m src.cli generate-demo

Running the Pipeline

Quick end-to-end demo

# 1. Generate synthetic prices + NLP signals
python -m src.cli generate-demo

# 2. Build feature store from demo data
python -m src.cli build-dataset --use-demo

# 3. Build NLP features (uses demo NLP data)
python -m src.cli build-nlp-features --use-demo

# 4. Train model
python -m src.cli train-model

# 5. Select cointegrated pairs
python -m src.cli select-pairs

# 6. Run backtest
python -m src.cli run-backtest --output reports/backtest

# 7. Run ablations
python -m src.cli run-ablation --output reports/ablations

# 8. Generate final report
python -m src.cli generate-report --output reports/final

Using real market data

python -m src.cli build-dataset   # downloads via yfinance (no key needed)
python -m src.cli build-nlp-features  # requires REDDIT_CLIENT_ID etc. in .env

Transaction Fees

Transaction costs are modelled in src/portfolio/costs.py with three components:

Component Default Description
Fixed fee $10.00 per order Charged for every buy or sell on a single ticker
Bid-ask half-spread 5 bps Half the bid-ask spread, applied on fill
Slippage / market impact 3 bps Estimated execution slippage

Counting rules:

  • Selling AAPL + buying MSFT = 2 orders = $20 fixed fees
  • Pair trades (long A + short B) = 2 orders = $20
  • Fees are tracked separately and reported as "fee drag" in metrics

All cost parameters are configurable in configs/backtest.yaml.


NLP / Event Module

The semantic event pipeline (src/nlp/) ingests public text and converts it into ticker-level market features.

Sources (all optional, graceful degradation if unavailable)

Source Connector Requires
Reddit PRAW REDDIT_CLIENT_ID + REDDIT_CLIENT_SECRET
Twitter/X Bearer token TWITTER_BEARER_TOKEN
RSS/News feedparser None (public feeds)
Mock/Demo JSON file None

Pipeline stages

  1. Entity linking — maps text to tickers, sectors, themes using a rule dictionary (e.g. "Trump" → tariffs/EV/China exposure; "Musk" → TSLA/AI/robotaxi) with embedding-similarity fallback for unlisted entities
  2. Relevance scoring — cosine similarity of post embedding to ticker embedding; optional zero-shot classifier
  3. Sentiment scoring — FinBERT (ProsusAI/finbert) returns positive/negative/neutral with confidence; influencer accounts receive boosted weight
  4. Event detection — categorises posts into earnings, product_launch, regulation, tariff_trade, macro_policy, ceo_statement, political_statement, etc.
  5. Feature generation — produces daily, ticker-aligned features:
    • rolling sentiment mean (1d, 3d, 7d)
    • mention volume spike (z-score)
    • influencer-weighted sentiment
    • recency-decayed relevance score
    • event novelty score

All features are timestamped at market close and shifted by 1 bar to prevent lookahead bias.

Example mappings

"Trump announces 25% tariffs on Chinese EVs"
  → tickers: [TSLA, XPEV, NIO], sectors: [Consumer Discretionary, Technology]
  → themes: [tariffs, trade_policy, china], event: tariff_trade

"Elon Musk: FSD v12 rolling out to all North America this quarter"
  → tickers: [TSLA], themes: [fsd, self_driving, product_launch]
  → sentiment: positive, confidence: 0.87, event: product_launch

Configuration

All behaviour is controlled by YAML files in configs/:

File Controls
base.yaml Universe, date range, rebalance frequency, sleeve weights
data.yaml Data provider, feature engineering parameters
model.yaml Active model, hyperparameters, walk-forward splits
backtest.yaml Execution assumptions, fees, optimizer params, ablation list
nlp.yaml Source connectors, entity dict, sentiment model, feature params

Evaluation Metrics

The backtest engine computes:

  • CAGR, annualised volatility
  • Sharpe ratio, Sortino ratio, Calmar ratio
  • Maximum drawdown
  • Hit rate (fraction of correct direction calls)
  • Turnover (annualised)
  • Beta to SPY
  • CVaR at 95%
  • Gross return vs net return
  • Transaction fee drag (fixed + spread + slippage)
  • Sleeve attribution (directional vs pair)
  • Factor / sector exposure summary

Ablations

run-ablation executes 6 pre-defined scenarios defined in configs/backtest.yaml:

  1. no_nlp — disable NLP features
  2. no_pairs — disable pair-trading sleeve
  3. no_cvar — replace CVaR optimizer with mean-variance
  4. fee_only — fixed fee only, no spread/slippage
  5. daily_rebalance — daily vs weekly frequency
  6. tree_only — XGBoost-only (no LSTM/ensemble)

Data Assumptions

  • Prices are corporate-action adjusted (splits + dividends via yfinance)
  • No survivorship bias guard in demo mode; production universe uses point-in-time SP500 membership
  • All features are computed on close prices and shifted by 1 trading day before use as model inputs
  • NLP features are stamped at 16:00 ET and aligned to the next market open

Known Limitations

  1. Survivorship bias — demo universe uses current SP500 tickers; production should use historical membership snapshots
  2. Twitter/X connector — Academic API access is required; stub returns empty list without key
  3. Truth Social — No official API; web ingestion is marked as TODO pending legal review
  4. FinBERT inference — GPU strongly recommended for large post volumes; CPU is supported but slow
  5. CVXPY solver — Falls back to SCS if MOSEK/Clarabel is unavailable; may be slower for large universes
  6. Options data — True implied volatility is not available via free APIs; realized vol is used as proxy

Project Structure

project/
  README.md
  requirements.txt
  pyproject.toml
  .env.example
  configs/           base.yaml | data.yaml | model.yaml | backtest.yaml | nlp.yaml
  data/
    raw/             downloaded raw files
    processed/       features.parquet, nlp_features.parquet, pairs.json
    cache/           yfinance cache, demo_prices.parquet, demo_nlp.json
  src/
    cli.py           Click CLI entry-point
    main.py          Programmatic pipeline orchestrator
    utils/           config, logging, schemas, demo_data
    data/            universe, market_data, macro_data, feature_store
    features/        technical, cross_sectional, macro_features, nlp_features
    models/          base_model, xgb, rf, lstm, transformer, regime, ensemble
    statarb/         pair_selection, cointegration, spread_signals, pair_portfolio
    nlp/             source_connectors, entity_linking, relevance, sentiment, events
    portfolio/       optimizer, constraints, sleeve_allocator, risk_model, costs
    backtest/        engine, broker, execution, metrics, walk_forward, attribution
    baselines/       equal_weight, buy_and_hold, logistic, arima, risk_parity
    reports/         tables, charts, export
  tests/
  notebooks/
    00_demo.ipynb    End-to-end walkthrough on synthetic data

About

A production-style research project combining ML-based return forecasting, CVaR-aware portfolio optimization, pair-trading, realistic transaction costs, and a semantic NLP event pipeline for S&P 500 equities.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors