Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,40 @@ GID=1000
# Anthropic
ANTHROPIC_API_KEY=sk-ant-...

# ── Auth (JWT) ────────────────────────────────────────────────────────────────
# Leave ALL of these blank to run in dev mode (no auth, anonymous superuser).
#
# Option A — HS256 shared secret (simple / internal):
# JWT_SECRET=supersecretkey
#
# Option B — RS256 via JWKS (Cognito, Auth0, Okta — recommended for production):
# JWT_JWKS_URL=https://cognito-idp.eu-west-1.amazonaws.com/<pool_id>/.well-known/jwks.json
# JWT_ISSUER=https://cognito-idp.eu-west-1.amazonaws.com/<pool_id>
# JWT_AUDIENCE=<app_client_id> # optional but recommended
#
# Option B2 — Microsoft Entra ID (Azure AD):
# JWT_JWKS_URL=https://login.microsoftonline.com/<tenant_id>/discovery/v2.0/keys
# JWT_ISSUER=https://login.microsoftonline.com/<tenant_id>/v2.0
# JWT_AUDIENCE=<azure_client_id>
#
JWT_SECRET=
JWT_JWKS_URL=
JWT_ISSUER=
JWT_AUDIENCE=

# ── Frontend Azure AD SSO (Vite build-time) ───────────────────────────────────
# Leave blank to run without SSO (dev mode — no login screen).
# Must match the app registration in Azure Entra ID.
VITE_AZURE_CLIENT_ID=
VITE_AZURE_TENANT_ID=

# ── Permissions (DynamoDB) ────────────────────────────────────────────────────
# Leave blank to run in dev mode (all users get superuser permissions).
# Table schema: PK=user_id (S), email (S), role_arn (S),
# allowed_datasets (SS), allowed_namespaces (SS), is_admin (BOOL)
#
DYNAMODB_PERMISSIONS_TABLE=

# AWS (puede ser IAM estático o credenciales temporales de STS)
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ data/
openlineage/data/
openlineage/events.ndjson
chroma_data/
knowledge/

# Caché de Maven / Ivy (JARs descargados por spark.jars.packages)
.ivy2/
Expand Down
21 changes: 21 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Changelog

## Unreleased

### 2026-05-06
- **docs: add contributing guide and MIT license** — Added CONTRIBUTING.md with setup instructions, project layout, development workflows, and commit conventions; added MIT LICENSE.md to establish open-source governance.

### 2026-05-05
- **feat(auth): add JWT authentication and role-based permissions** — Added JWT authentication (HS256 and RS256/JWKS) with support for dev mode, role-based access control via DynamoDB, and Azure AD SSO for frontend. Includes new /api/me endpoint and permission checks on API routes.

### 2026-05-05
- **refactor(hooks): defer changelog updates to post-commit hook** — Moved changelog file writing to post-commit hook to ensure clean staging area during prepare-commit-msg phase, fixing issues with changelog generation workflow.

### 2026-05-05
- **style: add Rootly logo to header and empty state** — Added Rootly logo image to application header and empty state UI for improved branding and visual identity.

### 2026-05-05
- **refactor: decompose GraphView and enhance RAG with semantic search** — Refactored GraphView into modular components (controls, context menu, layout), added semantic search with Athena integration, improved backend caching of filter terms and datasets, increased proxy timeout to 300s, and enhanced UI with markdown tables, code copy buttons, and loading suggestions.

### 2026-05-05
- **feat(rag): add Claude commit hooks and improve knowledge chunking** — Added git hooks that auto-generate conventional commit messages and changelog entries using Claude Haiku; improved knowledge document splitting to handle large tables and long sections intelligently.
91 changes: 91 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Contributing to Rootly

## Prerequisites

- Docker + Docker Compose
- Python 3.11+
- Node 20+ (frontend)
- An `ANTHROPIC_API_KEY`

## Local setup

```bash
cp .env.example .env # fill in ANTHROPIC_API_KEY (and S3 vars if needed)
docker compose up --build
```

Without S3 access, the system runs against the sample events in `openlineage/` and `examples/`.

## Project layout

| Path | Responsibility |
|---|---|
| `rag/` | Ingestion, vectorization, RAG pipeline, tools |
| `backend/` | FastAPI app, Celery tasks |
| `frontend/src/` | React + TypeScript UI |
| `knowledge/` | Business docs indexed into ChromaDB |
| `conf/` | Domain/agent configuration YAML |

## Making changes

### Backend / RAG

1. Edit code under `rag/` or `backend/`.
2. Restart the backend container: `docker compose restart backend`.
3. Re-index if you changed ingestion or vectorization: `POST /api/sync` or `python -m rag.query sync`.

### Frontend

```bash
cd frontend
npm install
npm run dev # dev server at http://localhost:5173
npm run build # production build
```

### Adding a new RAG tool

1. Create `rag/tools/<tool_name>.py` and implement the handler.
2. Register it in `rag/tools/__init__.py` (add to `TOOLS` list and `execute_tool_call` dispatcher).
3. Add a row to the tools table in `CLAUDE.md`.

## Testing

```bash
# Quick smoke test against local events
python -m rag.query ask "¿Qué datasets existen?"

# Impact analysis
python -m rag.query impact <dataset_name>
```

There is no automated test suite yet. Manual verification against `examples/` data is the current approach.

## Commit style

Follow [Conventional Commits](https://www.conventionalcommits.org/):

```
feat(rag): add reranking step to pipeline
fix(backend): avoid reload race on task completion
refactor(tools): extract S3 fetch helper
```

Scope is optional but encouraged (`rag`, `backend`, `frontend`, `tools`, `ingest`).

## Pull requests

- Branch from `main`, target `main`.
- One logical change per PR.
- Include a short description of *why*, not just what.
- If you change the RAG pipeline, note whether ChromaDB needs a full re-sync.

## Environment variables

| Variable | Required | Default | Purpose |
|---|---|---|---|
| `ANTHROPIC_API_KEY` | yes | — | Claude API access |
| `S3_BUCKET` | no | — | Source of OpenLineage events and Glue job code |
| `S3_EVENTS_PREFIX` | no | `openlineage/` | S3 prefix for `.ndjson` event files |
| `S3_JOBS_PREFIX` | no | `code/glue/jobs/AEMET/` | S3 prefix for Glue `.py` files |
| `REDIS_URL` | no | `redis://redis:6379/0` | Celery broker |
21 changes: 21 additions & 0 deletions LICENSE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# MIT License

Copyright (c) 2026 lucabem

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
120 changes: 120 additions & 0 deletions backend/auth.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
"""
JWT authentication dependency for FastAPI.

Modes (auto-detected from env vars, in priority order):
1. RS256/JWKS — JWT_JWKS_URL set (Cognito, Auth0, Okta)
2. HS256 — JWT_SECRET set (dev / internal shared secret)
3. Dev mode — neither set — returns anonymous superuser, no HTTP error

Required JWT claims: sub (user_id), email (optional, falls back to sub).
"""

import logging
import os
import time
from dataclasses import dataclass
from typing import Optional

import requests
from fastapi import HTTPException, Security
from fastapi.security import HTTPAuthorizationCredentials, HTTPBearer
from jose import JWTError, jwk, jwt

logger = logging.getLogger(__name__)

JWT_SECRET = os.getenv("JWT_SECRET", "")
JWT_JWKS_URL = os.getenv("JWT_JWKS_URL", "")
JWT_ISSUER = os.getenv("JWT_ISSUER", "")
JWT_AUDIENCE = os.getenv("JWT_AUDIENCE", "")

_bearer = HTTPBearer(auto_error=False)

# JWKS cache: avoid fetching on every request
_jwks_cache: dict = {"keys": [], "fetched_at": 0.0}
_JWKS_TTL = 3600 # re-fetch after 1 hour


@dataclass
class AuthUser:
user_id: str
email: str


def _get_jwks() -> dict:
now = time.time()
if now - _jwks_cache["fetched_at"] < _JWKS_TTL and _jwks_cache["keys"]:
return _jwks_cache
try:
resp = requests.get(JWT_JWKS_URL, timeout=10)
resp.raise_for_status()
data = resp.json()
_jwks_cache.update({"keys": data.get("keys", []), "fetched_at": now})
return _jwks_cache
except Exception as e:
logger.error(f"Failed to fetch JWKS from {JWT_JWKS_URL}: {e}")
if _jwks_cache["keys"]:
return _jwks_cache # serve stale on transient errors
raise HTTPException(status_code=503, detail="Auth service unavailable.")


def _decode_rs256(token: str) -> dict:
header = jwt.get_unverified_header(token)
kid = header.get("kid")
jwks = _get_jwks()
key_data = next((k for k in jwks["keys"] if k.get("kid") == kid), None)
if key_data is None:
# Retry once — key may have rotated
_jwks_cache["fetched_at"] = 0.0
jwks = _get_jwks()
key_data = next((k for k in jwks["keys"] if k.get("kid") == kid), None)
if key_data is None:
raise JWTError(f"No public key found for kid={kid!r}")
public_key = jwk.construct(key_data)
options: dict = {"verify_aud": bool(JWT_AUDIENCE)}
return jwt.decode(
token,
public_key.to_dict(),
algorithms=["RS256"],
audience=JWT_AUDIENCE or None,
issuer=JWT_ISSUER or None,
options=options,
)


def _decode_hs256(token: str) -> dict:
options: dict = {"verify_aud": bool(JWT_AUDIENCE)}
return jwt.decode(
token,
JWT_SECRET,
algorithms=["HS256"],
audience=JWT_AUDIENCE or None,
issuer=JWT_ISSUER or None,
options=options,
)


def get_current_user(
credentials: Optional[HTTPAuthorizationCredentials] = Security(_bearer),
) -> AuthUser:
# Dev mode: no auth configured → anonymous superuser
if not JWT_SECRET and not JWT_JWKS_URL:
return AuthUser(user_id="anonymous", email="anonymous@local")

if credentials is None:
raise HTTPException(status_code=401, detail="Authorization header required.")

token = credentials.credentials
try:
if JWT_JWKS_URL:
payload = _decode_rs256(token)
else:
payload = _decode_hs256(token)
except JWTError as e:
raise HTTPException(status_code=401, detail=f"Invalid token: {e}")

user_id: Optional[str] = payload.get("sub")
if not user_id:
raise HTTPException(status_code=401, detail="Token missing 'sub' claim.")

email: str = payload.get("email") or payload.get("username") or user_id
return AuthUser(user_id=user_id, email=email)
Loading