AlcoaBase (ALC) 🧬

AlcoaBase (ALC) is a 100% local, open-source Document & Knowledge Management System. It unites ALCOA+ data integrity with AI, featuring deterministic PDF protocol generation and automated corresponding report data processing, training-gated workflows, RAG, and automated Computer System Validation.

Designed specifically for highly regulated environments (e.g., Pharma, Biotech, Manufacturing), AlcoaBase provides a completely air-gapped solution that bridges the gap between strict compliance and cutting-edge artificial intelligence.

✨ Core Features

🤖 Multi-Agent "Always-On" Auditing Submit documents for parallel review by N AI auditor agents. A supervisory Master Auditor synthesizes findings into a unified executive summary with consensus detection, contradiction flagging, compliance scoring, and prioritized action items. Includes compliance scorecards, missing link detection, and anomaly monitoring.
🛡️ ALCOA+ Data Integrity & Audit Trail Cryptographic digital signatures (PAdES), strict versioning, and immutable database audit logs for every action (Who, What, When, Why). Every mutating API request requires an X-Change-Reason header for full traceability.
📄 Deterministic PDF-to-Database Mapping for Protocols and Reports A visual JSON-driven form builder that creates both React web forms and offline-capable PDFs. Using a proprietary Dual-UUID concept, data entered into offline PDFs is flawlessly extracted and mapped directly to relational PostgreSQL tables upon upload.
🧠 Local AI & Retrieval-Augmented Generation (RAG) Ask questions against all your documents. Built for high-performance inference (optimized for NVIDIA Blackwell GPUs via vLLM) while maintaining 100% data sovereignty.
⚙️ Dynamic BPMN Workflows & Execution Admins can visually design individual document lifecycles based on meta tags (Draft → Review → Approved → InTraining → Active) using a drag-and-drop BPMN editor. Users execute transitions directly from the document detail page with mandatory change reasons, gate indicators (signature/training), risk-level warnings, and a full audit history timeline.
🎓 Training-Gated Execution with Comprehension Quiz (RBAC/ABAC) Strict access control ensures users can only execute tasks or create reports for specific Standard Operating Procedures (SOPs) if they possess a valid training record AND have passed a comprehension quiz for that exact document version. Quiz attempts are immutable (append-only) for full ALCOA+ audit compliance.
🧠 AI-Enhanced Training Ecosystem Transforms static training into an adaptive learning platform. AI generates personalized schedules, training materials (summaries, walkthroughs, presentations), comprehension quizzes with semantic grading, and interactive "Virtual Audit" role-play sessions. Dynamic feedback points users to exact source paragraphs when they answer incorrectly.
📝 AI Document Generator (Template-Based) Generate regulatory documents from registered Master Templates. The system extracts template structure (headings, numbering, placeholders), retrieves relevant knowledge base content, and synthesizes section-by-section content using AI. Includes immutable provenance audit trails, cross-reference extraction, and a mandatory human review workflow before documents enter the active ecosystem.
🔄 AI-Driven Change Impact Analysis Automatically detects when documents are updated and identifies affected downstream documents and training tasks. Builds a dependency graph from cross-references, database links, and semantic similarity. Performs section-level gap analysis with severity ratings (critical/major/minor) and AI-generated remediation suggestions. Produces immutable audit reports and notifies document owners of required updates.
🔗 AI-Powered Traceability & Gap Discovery Automatically generates Traceability Matrices by crawling requirements documents (URS) and mapping them to test cases (IQ/OQ/PQ/MVP) using three-pass matching (exact ID, cross-reference, semantic similarity). Detects orphan requirements and test cases, computes coverage metrics and compliance readiness scores, and integrates with Change Impact Analysis to flag stale links when requirements change.
✅ Automated Computer System Validation (CSV) A built-in, isolated testing environment. On command, a dedicated Playwright container performs End-to-End (E2E) UI tests, signs documents, verifies database states, and generates a tamper-proof Validation Certificate for FDA/EMA audits.
🔐 Granular Role-Based Access Control (RBAC) Five specialized roles (System Admin, Document Admin, IT Admin, Member, Viewer) with resource-action permission matrices. Permission templates define document-type-specific access rules using a "most restrictive wins" policy. Full user lifecycle management with audit-compliant deactivation, password reset, and company membership management.

🏗️ Architecture & Tech Stack

AlcoaBase cleanly separates structured, compliance-critical data from unstructured, semantic AI operations.

Frontend:

React (Vite) + TypeScript
Tailwind CSS & shadcn/ui
Zustand (State Management)
@hello-pangea/dnd & react-hook-form (Visual Form Builder)

Backend:

Python (FastAPI)
SQLAlchemy 2.0 + SQLAlchemy-Continuum (Automated Audit Tables)
SpiffWorkflow (BPMN Workflow Engine)
Celery + Redis (Background Jobs)
ReportLab & PyMuPDF (PDF Generation & UUID Extraction)
watchfiles (Agent YAML hot-reload)

Data & AI Layer:

PostgreSQL (Source of truth, GLP records, User roles)
MinIO (S3-compatible object storage for physical PDFs)
OpenSearch (Vector database for RAG and hybrid lexical/semantic search)
vLLM (Local LLM inference — hybrid architecture with dedicated embedding instance + shared chat/OCR instance)
httpx (Async HTTP client with connection pooling for vLLM communication)

Validation:

Playwright (Automated E2E testing for the CSV module)

📊 Implementation Status

Phase	Feature	Status
1.1	Multi-Tenancy / Company Separation	✅ Complete
1.2	Setup Wizard	✅ Complete
1.3	Authentication & Session Management	✅ Complete
2.1	Document Upload & List	✅ Complete
2.2	Virtual Folders	✅ Complete
2.3	Document Versioning UI	✅ Complete
2.4	Template Builder (Drag & Drop)	✅ Complete
2.5	Report Data Entry & PDF Extraction	✅ Complete
3.1	BPMN Workflow Visual Editor	✅ Complete
3.2	Workflow Execution & State Transitions	✅ Complete
3.3	Training Management UI	✅ Complete
3.4	Electronic Signatures UI	✅ Complete
3.5	Comprehension Quiz & Dual Gate	✅ Complete
4.1	Search UI Integration	✅ Complete
4.2	RAG Knowledge Base UI	✅ Complete
4.3	AI Model Integration (vLLM)	✅ Complete
4.4	Multimodal Knowledge Base	✅ Complete
5.1	Modular Agent Registry & Personality Framework	✅ Complete
5.2	Multi-Agent "Always-On" Auditing	✅ Complete
5.3	AI-Enhanced Training Ecosystem	✅ Complete
5.4	AI Document Generator (Template-Based)	✅ Complete
5.5	AI-Driven Change Impact Analysis	✅ Complete
5.6	AI-Powered Traceability & Gap Discovery	✅ Complete
6.1	Admin Dashboard — User Management	✅ Complete

See the full roadmap in todo.md.

📖 Documentation

User guides for each major feature are available in the docs/ directory:

Guide	Description
Setup Wizard	First-run initialization: admin account, company, AI mode
Document Upload	Uploading documents via web UI and bulk CLI tool
Workflow Editor	Designing BPMN document lifecycle workflows
Workflow Execution	Executing state transitions, gate indicators, and audit history
Training Management	Training tasks, content viewer, records, gate enforcement, admin view
Comprehension Quiz	Quiz taking, scoring, pass/fail, dual gate verification, audit trail
Electronic Signatures	PAdES signing, re-authentication, verification, certificate configuration
AI Inference	vLLM integration: RAG queries, embeddings, OCR, health monitoring, mock mode
Search & Knowledge Base	Hybrid search, RAG knowledge chat, document indexing, visual content
Agent Management	Agent registry, archetypes, personality profiles, tuning parameters, hot-reload
Multi-Agent Auditing	Parallel review pipeline, audit profiles, compliance scorecards, anomaly detection
AI Training Ecosystem	AI planner, material generation, question generation, virtual audits, dynamic feedback
AI Document Generator	Template-based document generation, provenance audit trail, cross-references, review workflow
Change Impact Analysis	Dependency graph, automatic impact detection, gap analysis, notifications, training task resets
Traceability & Gap Discovery	Automated traceability matrices, three-pass matching, orphan detection, coverage metrics, stale link alerts
Admin Dashboard — User Management	RBAC roles, user CRUD, permission templates, company memberships, audit-compliant lifecycle management

🚀 Getting Started

AlcoaBase is designed to be deployed quickly via Docker Compose.

Prerequisites

Docker & Docker Compose
uv — used for all Python dependency management and CLI tooling
(Optional but recommended) NVIDIA GPU with container toolkit installed for AI features. A CPU-mock mode is available for local testing.

Installation

Clone the repository:

git clone [https://github.com/your-org/alcoabase.git](https://github.com/your-org/alcoabase.git)
cd alcoabase

Start the environment:
```
docker compose up -d
```
Run the Setup Wizard:

Open your browser and navigate to http://localhost:3000. You will be greeted by the AlcoaBase Setup Wizard. Follow the steps to create your root administrator account, configure your AI hardware settings, and (optionally) seed the database with demo users, BPMN workflows, and SOPs.

🧪 Testing

AlcoaBase has three test layers. All use uv run pytest from the src/backend/ directory.

Unit & Property Tests

Fast, no external dependencies. Runs against in-memory SQLite.

cd src/backend
uv run pytest --tb=short -q

Property-based tests use Hypothesis to validate correctness invariants (tenant isolation, membership rules, migration backfill, quiz scoring, training gate dual verification, etc.).

Integration Tests

Tests the full FastAPI request lifecycle (middleware → dependency → route → DB) using an async in-memory SQLite database. No Docker required.

cd src/backend
uv run pytest tests/integration/ -v

Smoke Tests (against Docker Compose)

Hits the real backend running on PostgreSQL to validate migrations, constraint behavior, and end-to-end flows.

# 1. Start the stack
docker compose up -d

# 2. Wait for healthy backend, then run the migration
docker compose exec backend alembic upgrade head

# 3. Run smoke tests
cd src/backend
uv run pytest tests/smoke/ -v

# Optional: point at a different host
SMOKE_TEST_BASE_URL=http://your-host:8080 uv run pytest tests/smoke/ -v

Smoke tests generate unique slugs per run, so they're safe to execute repeatedly without cleanup.

✅ Computer System Validation (CSV)

To prove to auditors that your local instance of AlcoaBase functions exactly as specified, you can trigger the automated CSV process.

Navigate to the Admin Dashboard -> Validation.
Click "Run Full System Validation".
A dedicated, isolated testing user will automatically run through complete lifecycles (creation, approval, PDF generation, signing, and data extraction).
Upon completion, a digitally signed CSV Report (PDF) will be deposited directly into your AlcoaBase document repository.

🤝 Contributing

We welcome contributions! Whether it's improving the AI prompts, adding new PDF field types, or enhancing the BPMN engine, please check out our CONTRIBUTING.md for guidelines.

🔒 Air-Gapped Deployment

AlcoaBase is designed for fully air-gapped environments where no internet connectivity is available. All AI inference runs locally using pre-downloaded model weights.

Model Download (Internet-Connected Machine)

Before deploying to an air-gapped environment, download the required model weights on a machine with internet access.

First, sync the project dependencies (this installs huggingface-cli via the huggingface-hub package):

# From the project root
cd src/backend
uv sync

Then download models into the project-root models/ directory. AlcoaBase provides two model profiles — pick the one that matches your hardware.

Note: Some models (e.g., Llama) require accepting a license on huggingface.co and authenticating first:
uv run --project src/backend huggingface-cli login

Small Profile (default — fits on a single 24 GB GPU)

Best for development, testing, and smaller deployments. These are the defaults in .env.example.

Role	Model	Active Params	Download Size
Chat / Generation	`Qwen/Qwen3.6-35B-A3B` (MoE)	~3B	~8 GB
Embedding	`Qwen/Qwen3-Embedding-0.6B`	0.6B	~1.2 GB
Vision / OCR	`google/gemma-4-E4B-it`	~4B	~8 GB

cd ../..
mkdir -p models

# Chat: Qwen3.6 35B MoE (only ~3B active params per token)
uv run --project src/backend hf download Qwen/Qwen3.6-35B-A3B --local-dir models/qwen3.6-35b-a3b

# Embedding: Qwen3-Embedding 0.6B (1024-dim output)
uv run --project src/backend hf download Qwen/Qwen3-Embedding-8B --local-dir models/qwen3-embedding-8b
# or Qwen/Qwen3-Embedding-8B
# or Qwen/Qwen3-Embedding-0.6B

# Vision/OCR: Gemma 4 E4B (native vision + OCR)
uv run --project src/backend hf download google/gemma-4-E4B-it --local-dir models/gemma-4-e4b-it

Large Profile (production — requires ≥80 GB VRAM)

For production deployments on high-end hardware (e.g., NVIDIA A100/H100/Blackwell).

Role	Model	Params	Download Size
Chat / Generation	`meta-llama/Llama-3.3-70B-Instruct`	70B	~140 GB
Embedding	`Qwen/Qwen3-Embedding-8B`	8B	~16 GB
Vision / OCR	`Qwen/Qwen2.5-VL-72B-Instruct`	72B	~145 GB

cd ../..
mkdir -p models

# Chat: Llama 3.3 70B Instruct (requires license acceptance on HuggingFace)
uv run --project src/backend hf download meta-llama/Llama-3.3-70B-Instruct --local-dir models/llama-3.3-70b-instruct

# Embedding: Qwen3-Embedding 8B (#1 on MTEB multilingual leaderboard, 1024-dim)
uv run --project src/backend hf download Qwen/Qwen3-Embedding-8B --local-dir models/qwen3-embedding-8b

# Vision/OCR: Qwen2.5-VL 72B
uv run --project src/backend hf download Qwen/Qwen2.5-VL-72B-Instruct --local-dir models/qwen2.5-vl-72b-instruct

Switching profiles: Update the MODEL_* variables in your .env to point to the downloaded weights. Both profiles use the same embedding dimension (1024), so no OpenSearch index rebuild is needed when upgrading.

Transfer to Air-Gapped Environment

Transfer the models/ directory to the target machine via approved media (USB drive, internal network share, etc.):

# Example: copy to target machine
rsync -avP models/ target-machine:/path/to/alcoabase/models/

Configuration

Configure the model paths in your .env file. The defaults match the Small Profile:

# Model Manager Mode: gpu (production), cpu (fallback), mock (development)
MODEL_MANAGER_MODE=gpu

# ── Chat / Generation LLM ──
# SMALL (default):
MODEL_CHAT_NAME=Qwen/Qwen3.6-35B-A3B
MODEL_CHAT_PATH=/models/qwen3.6-35b-a3b
MODEL_CHAT_MAX_GPU_MEMORY_GB=24
# LARGE (uncomment to upgrade):
# MODEL_CHAT_NAME=meta-llama/Llama-3.3-70B-Instruct
# MODEL_CHAT_PATH=/models/llama-3.3-70b-instruct
# MODEL_CHAT_MAX_GPU_MEMORY_GB=60

# ── Multilingual Embedding Model ──
# SMALL (default):
MODEL_EMBEDDING_NAME=Qwen/Qwen3-Embedding-0.6B
MODEL_EMBEDDING_PATH=/models/qwen3-embedding-0.6b
MODEL_EMBEDDING_DIMENSION=1024
# LARGE (uncomment to upgrade):
# MODEL_EMBEDDING_NAME=Qwen/Qwen3-Embedding-8B
# MODEL_EMBEDDING_PATH=/models/qwen3-embedding-8b
# MODEL_EMBEDDING_DIMENSION=1024

# ── Vision / OCR Model ──
# SMALL (default):
MODEL_OCR_NAME=google/gemma-4-E4B-it
MODEL_OCR_PATH=/models/gemma-4-e4b-it
# LARGE (uncomment to upgrade):
# MODEL_OCR_NAME=Qwen/Qwen2.5-VL-72B-Instruct
# MODEL_OCR_PATH=/models/qwen2.5-vl-72b-instruct

# GPU Configuration
GPU_DEVICE_ID=0

Docker Compose Volume Mount

The docker-compose.yml mounts the models directory into the vLLM container (configured via MODEL_WEIGHTS_PATH in .env, defaults to ./models):

services:
  vllm:
    volumes:
      - ${MODEL_WEIGHTS_PATH:-./models}:/models:ro

Network Isolation

AI containers are configured with no outbound internet access. The Docker Compose network configuration ensures:

The vLLM container has no external network access
All inter-service communication uses the internal Docker network
No document content, embeddings, or queries leave the deployment

Development Without GPU (Mock Mode)

For local development and testing without GPU hardware:

MODEL_MANAGER_MODE=mock

Mock mode returns:

Random vectors of the correct embedding dimension (1024) for embedding requests
Placeholder text responses for LLM completion requests
Simulated OCR text for scanned PDF processing

This allows full application testing without GPU hardware or downloaded model weights.

AI Disclosure

This project was developed with assistance from AI coding tools, including kiro, opencode, qwen, claude. All outputs were reviewed, tested, and accepted by the maintainers. AI was used to support development; all architectural decisions and responsibility remain with the authors.

Disclaimer

This software is provided as is, without warranty of any kind. The authors and contributors are not responsible for any damages, losses, or issues arising from its use, including design errors, hardware damage, manufacturing mistakes, or data loss. AI-generated suggestions are not a replacement for qualified engineering review, and any safety-critical use requires independent expert validation.

📄 License

This project is licensed under the Apache 2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.kiro		.kiro
.vscode		.vscode
Requirements		Requirements
agents		agents
docs		docs
src		src
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
todo.md		todo.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AlcoaBase (ALC) 🧬

✨ Core Features

🏗️ Architecture & Tech Stack

📊 Implementation Status

📖 Documentation

🚀 Getting Started

Prerequisites

Installation

🧪 Testing

Unit & Property Tests

Integration Tests

Smoke Tests (against Docker Compose)

✅ Computer System Validation (CSV)

🤝 Contributing

🔒 Air-Gapped Deployment

Model Download (Internet-Connected Machine)

Small Profile (default — fits on a single 24 GB GPU)

Large Profile (production — requires ≥80 GB VRAM)

Transfer to Air-Gapped Environment

Configuration

Docker Compose Volume Mount

Network Isolation

Development Without GPU (Mock Mode)

AI Disclosure

Disclaimer

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AlcoaBase (ALC) 🧬

✨ Core Features

🏗️ Architecture & Tech Stack

📊 Implementation Status

📖 Documentation

🚀 Getting Started

Prerequisites

Installation

🧪 Testing

Unit & Property Tests

Integration Tests

Smoke Tests (against Docker Compose)

✅ Computer System Validation (CSV)

🤝 Contributing

🔒 Air-Gapped Deployment

Model Download (Internet-Connected Machine)

Small Profile (default — fits on a single 24 GB GPU)

Large Profile (production — requires ≥80 GB VRAM)

Transfer to Air-Gapped Environment

Configuration

Docker Compose Volume Mount

Network Isolation

Development Without GPU (Mock Mode)

AI Disclosure

Disclaimer

📄 License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages