AI-Driven Portfolio Optimization & Backtesting System

A production-style research project combining ML-based return forecasting, CVaR-aware portfolio optimization, pair-trading, realistic transaction costs, and a semantic NLP event pipeline for S&P 500 equities.

Architecture

Text Sources                  Market / Macro Data
(Reddit, RSS, Twitter, mock)  (yfinance / demo fallback)
        │                              │
        ▼                              ▼
  NLP Pipeline              Feature Engineering
  ┌──────────────┐          ┌────────────────────┐
  │ entity link  │          │ technical features │
  │ relevance    │────────▶ │ cross-sectional    │
  │ sentiment    │          │ macro features     │
  │ event detect │          │ NLP features       │
  └──────────────┘          └────────┬───────────┘
                                     │
                                     ▼
                             ML Model Layer
                    (XGBoost / RF / LSTM / Transformer / Ensemble)
                                     │
                    ┌────────────────┴─────────────────┐
                    ▼                                   ▼
           Directional Sleeve                 Pair-Trading Sleeve
           CVaR Optimizer (cvxpy)             Cointegration / Z-score
                    │                                   │
                    └──────────────┬────────────────────┘
                                   ▼
                          Sleeve Allocator
                         (regime-aware weights)
                                   │
                                   ▼
                         Backtest Engine
                    (walk-forward, next-bar fill)
                                   │
                    ┌──────────────┼──────────────┐
                    ▼              ▼              ▼
              Broker         Metrics         Attribution
           ($10/trade fee)   (Sharpe, CVaR)  (sleeve P&L)
                                   │
                                   ▼
                            Reports / Artifacts
                         (HTML, CSV, PNG, Jupyter)

Setup

1. Clone and install

git clone <repo_url>
cd project
python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS/Linux
source .venv/bin/activate

pip install -r requirements.txt
pip install -e .

2. Configure API keys (optional)

cp .env.example .env
# Edit .env with your keys

The system runs fully without any API keys using synthetic demo data.

3. Generate demo data (no API keys required)

python -m src.cli generate-demo

Running the Pipeline

Quick end-to-end demo

# 1. Generate synthetic prices + NLP signals
python -m src.cli generate-demo

# 2. Build feature store from demo data
python -m src.cli build-dataset --use-demo

# 3. Build NLP features (uses demo NLP data)
python -m src.cli build-nlp-features --use-demo

# 4. Train model
python -m src.cli train-model

# 5. Select cointegrated pairs
python -m src.cli select-pairs

# 6. Run backtest
python -m src.cli run-backtest --output reports/backtest

# 7. Run ablations
python -m src.cli run-ablation --output reports/ablations

# 8. Generate final report
python -m src.cli generate-report --output reports/final

Using real market data

python -m src.cli build-dataset   # downloads via yfinance (no key needed)
python -m src.cli build-nlp-features  # requires REDDIT_CLIENT_ID etc. in .env

Transaction Fees

Transaction costs are modelled in src/portfolio/costs.py with three components:

Component	Default	Description
Fixed fee	$10.00 per order	Charged for every buy or sell on a single ticker
Bid-ask half-spread	5 bps	Half the bid-ask spread, applied on fill
Slippage / market impact	3 bps	Estimated execution slippage

Counting rules:

Selling AAPL + buying MSFT = 2 orders = $20 fixed fees
Pair trades (long A + short B) = 2 orders = $20
Fees are tracked separately and reported as "fee drag" in metrics

All cost parameters are configurable in configs/backtest.yaml.

NLP / Event Module

The semantic event pipeline (src/nlp/) ingests public text and converts it into ticker-level market features.

Sources (all optional, graceful degradation if unavailable)

Source	Connector	Requires
Reddit	PRAW	`REDDIT_CLIENT_ID` + `REDDIT_CLIENT_SECRET`
Twitter/X	Bearer token	`TWITTER_BEARER_TOKEN`
RSS/News	feedparser	None (public feeds)
Mock/Demo	JSON file	None

Pipeline stages

Entity linking — maps text to tickers, sectors, themes using a rule dictionary (e.g. "Trump" → tariffs/EV/China exposure; "Musk" → TSLA/AI/robotaxi) with embedding-similarity fallback for unlisted entities
Relevance scoring — cosine similarity of post embedding to ticker embedding; optional zero-shot classifier
Sentiment scoring — FinBERT (ProsusAI/finbert) returns positive/negative/neutral with confidence; influencer accounts receive boosted weight
Event detection — categorises posts into earnings, product_launch, regulation, tariff_trade, macro_policy, ceo_statement, political_statement, etc.
Feature generation — produces daily, ticker-aligned features:
- rolling sentiment mean (1d, 3d, 7d)
- mention volume spike (z-score)
- influencer-weighted sentiment
- recency-decayed relevance score
- event novelty score

All features are timestamped at market close and shifted by 1 bar to prevent lookahead bias.

Example mappings

"Trump announces 25% tariffs on Chinese EVs"
  → tickers: [TSLA, XPEV, NIO], sectors: [Consumer Discretionary, Technology]
  → themes: [tariffs, trade_policy, china], event: tariff_trade

"Elon Musk: FSD v12 rolling out to all North America this quarter"
  → tickers: [TSLA], themes: [fsd, self_driving, product_launch]
  → sentiment: positive, confidence: 0.87, event: product_launch

Configuration

All behaviour is controlled by YAML files in configs/:

File	Controls
`base.yaml`	Universe, date range, rebalance frequency, sleeve weights
`data.yaml`	Data provider, feature engineering parameters
`model.yaml`	Active model, hyperparameters, walk-forward splits
`backtest.yaml`	Execution assumptions, fees, optimizer params, ablation list
`nlp.yaml`	Source connectors, entity dict, sentiment model, feature params

Evaluation Metrics

The backtest engine computes:

CAGR, annualised volatility
Sharpe ratio, Sortino ratio, Calmar ratio
Maximum drawdown
Hit rate (fraction of correct direction calls)
Turnover (annualised)
Beta to SPY
CVaR at 95%
Gross return vs net return
Transaction fee drag (fixed + spread + slippage)
Sleeve attribution (directional vs pair)
Factor / sector exposure summary

Ablations

run-ablation executes 6 pre-defined scenarios defined in configs/backtest.yaml:

no_nlp — disable NLP features
no_pairs — disable pair-trading sleeve
no_cvar — replace CVaR optimizer with mean-variance
fee_only — fixed fee only, no spread/slippage
daily_rebalance — daily vs weekly frequency
tree_only — XGBoost-only (no LSTM/ensemble)

Data Assumptions

Prices are corporate-action adjusted (splits + dividends via yfinance)
No survivorship bias guard in demo mode; production universe uses point-in-time SP500 membership
All features are computed on close prices and shifted by 1 trading day before use as model inputs
NLP features are stamped at 16:00 ET and aligned to the next market open

Known Limitations

Survivorship bias — demo universe uses current SP500 tickers; production should use historical membership snapshots
Twitter/X connector — Academic API access is required; stub returns empty list without key
Truth Social — No official API; web ingestion is marked as TODO pending legal review
FinBERT inference — GPU strongly recommended for large post volumes; CPU is supported but slow
CVXPY solver — Falls back to SCS if MOSEK/Clarabel is unavailable; may be slower for large universes
Options data — True implied volatility is not available via free APIs; realized vol is used as proxy

Project Structure

project/
  README.md
  requirements.txt
  pyproject.toml
  .env.example
  configs/           base.yaml | data.yaml | model.yaml | backtest.yaml | nlp.yaml
  data/
    raw/             downloaded raw files
    processed/       features.parquet, nlp_features.parquet, pairs.json
    cache/           yfinance cache, demo_prices.parquet, demo_nlp.json
  src/
    cli.py           Click CLI entry-point
    main.py          Programmatic pipeline orchestrator
    utils/           config, logging, schemas, demo_data
    data/            universe, market_data, macro_data, feature_store
    features/        technical, cross_sectional, macro_features, nlp_features
    models/          base_model, xgb, rf, lstm, transformer, regime, ensemble
    statarb/         pair_selection, cointegration, spread_signals, pair_portfolio
    nlp/             source_connectors, entity_linking, relevance, sentiment, events
    portfolio/       optimizer, constraints, sleeve_allocator, risk_model, costs
    backtest/        engine, broker, execution, metrics, walk_forward, attribution
    baselines/       equal_weight, buy_and_hold, logistic, arima, risk_parity
    reports/         tables, charts, export
  tests/
  notebooks/
    00_demo.ipynb    End-to-end walkthrough on synthetic data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI-Driven Portfolio Optimization & Backtesting System

Architecture

Setup

1. Clone and install

2. Configure API keys (optional)

3. Generate demo data (no API keys required)

Running the Pipeline

Quick end-to-end demo

Using real market data

Transaction Fees

NLP / Event Module

Sources (all optional, graceful degradation if unavailable)

Pipeline stages

Example mappings

Configuration

Evaluation Metrics

Ablations

Data Assumptions

Known Limitations

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
data		data
logs		logs
notebooks		notebooks
reports		reports
src		src
tests		tests
.env.example		.env.example
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AI-Driven Portfolio Optimization & Backtesting System

Architecture

Setup

1. Clone and install

2. Configure API keys (optional)

3. Generate demo data (no API keys required)

Running the Pipeline

Quick end-to-end demo

Using real market data

Transaction Fees

NLP / Event Module

Sources (all optional, graceful degradation if unavailable)

Pipeline stages

Example mappings

Configuration

Evaluation Metrics

Ablations

Data Assumptions

Known Limitations

Project Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages