AlcoaBase (ALC) is a 100% local, open-source Document & Knowledge Management System. It unites ALCOA+ data integrity with AI, featuring deterministic PDF protocol generation and automated corresponding report data processing, training-gated workflows, RAG, and automated Computer System Validation.
Designed specifically for highly regulated environments (e.g., Pharma, Biotech, Manufacturing), AlcoaBase provides a completely air-gapped solution that bridges the gap between strict compliance and cutting-edge artificial intelligence.
- π€ Multi-Agent "Always-On" Auditing Submit documents for parallel review by N AI auditor agents. A supervisory Master Auditor synthesizes findings into a unified executive summary with consensus detection, contradiction flagging, compliance scoring, and prioritized action items. Includes compliance scorecards, missing link detection, and anomaly monitoring.
- π‘οΈ ALCOA+ Data Integrity & Audit Trail
Cryptographic digital signatures (PAdES), strict versioning, and immutable database audit logs for every action (Who, What, When, Why). Every mutating API request requires an
X-Change-Reasonheader for full traceability. - π Deterministic PDF-to-Database Mapping for Protocols and Reports A visual JSON-driven form builder that creates both React web forms and offline-capable PDFs. Using a proprietary Dual-UUID concept, data entered into offline PDFs is flawlessly extracted and mapped directly to relational PostgreSQL tables upon upload.
- π§ Local AI & Retrieval-Augmented Generation (RAG)
Ask questions against all your documents. Built for high-performance inference (optimized for NVIDIA Blackwell GPUs via
vLLM) while maintaining 100% data sovereignty. - βοΈ Dynamic BPMN Workflows & Execution Admins can visually design individual document lifecycles based on meta tags (Draft β Review β Approved β InTraining β Active) using a drag-and-drop BPMN editor. Users execute transitions directly from the document detail page with mandatory change reasons, gate indicators (signature/training), risk-level warnings, and a full audit history timeline.
- π Training-Gated Execution with Comprehension Quiz (RBAC/ABAC) Strict access control ensures users can only execute tasks or create reports for specific Standard Operating Procedures (SOPs) if they possess a valid training record AND have passed a comprehension quiz for that exact document version. Quiz attempts are immutable (append-only) for full ALCOA+ audit compliance.
- π§ AI-Enhanced Training Ecosystem Transforms static training into an adaptive learning platform. AI generates personalized schedules, training materials (summaries, walkthroughs, presentations), comprehension quizzes with semantic grading, and interactive "Virtual Audit" role-play sessions. Dynamic feedback points users to exact source paragraphs when they answer incorrectly.
- π AI Document Generator (Template-Based) Generate regulatory documents from registered Master Templates. The system extracts template structure (headings, numbering, placeholders), retrieves relevant knowledge base content, and synthesizes section-by-section content using AI. Includes immutable provenance audit trails, cross-reference extraction, and a mandatory human review workflow before documents enter the active ecosystem.
- π AI-Driven Change Impact Analysis Automatically detects when documents are updated and identifies affected downstream documents and training tasks. Builds a dependency graph from cross-references, database links, and semantic similarity. Performs section-level gap analysis with severity ratings (critical/major/minor) and AI-generated remediation suggestions. Produces immutable audit reports and notifies document owners of required updates.
- π AI-Powered Traceability & Gap Discovery Automatically generates Traceability Matrices by crawling requirements documents (URS) and mapping them to test cases (IQ/OQ/PQ/MVP) using three-pass matching (exact ID, cross-reference, semantic similarity). Detects orphan requirements and test cases, computes coverage metrics and compliance readiness scores, and integrates with Change Impact Analysis to flag stale links when requirements change.
- β Automated Computer System Validation (CSV) A built-in, isolated testing environment. On command, a dedicated Playwright container performs End-to-End (E2E) UI tests, signs documents, verifies database states, and generates a tamper-proof Validation Certificate for FDA/EMA audits.
- π Granular Role-Based Access Control (RBAC) Five specialized roles (System Admin, Document Admin, IT Admin, Member, Viewer) with resource-action permission matrices. Permission templates define document-type-specific access rules using a "most restrictive wins" policy. Full user lifecycle management with audit-compliant deactivation, password reset, and company membership management.
AlcoaBase cleanly separates structured, compliance-critical data from unstructured, semantic AI operations.
Frontend:
- React (Vite) + TypeScript
- Tailwind CSS & shadcn/ui
- Zustand (State Management)
@hello-pangea/dnd&react-hook-form(Visual Form Builder)
Backend:
- Python (FastAPI)
- SQLAlchemy 2.0 + SQLAlchemy-Continuum (Automated Audit Tables)
- SpiffWorkflow (BPMN Workflow Engine)
- Celery + Redis (Background Jobs)
- ReportLab & PyMuPDF (PDF Generation & UUID Extraction)
- watchfiles (Agent YAML hot-reload)
Data & AI Layer:
- PostgreSQL (Source of truth, GLP records, User roles)
- MinIO (S3-compatible object storage for physical PDFs)
- OpenSearch (Vector database for RAG and hybrid lexical/semantic search)
- vLLM (Local LLM inference β hybrid architecture with dedicated embedding instance + shared chat/OCR instance)
- httpx (Async HTTP client with connection pooling for vLLM communication)
Validation:
- Playwright (Automated E2E testing for the CSV module)
| Phase | Feature | Status |
|---|---|---|
| 1.1 | Multi-Tenancy / Company Separation | β Complete |
| 1.2 | Setup Wizard | β Complete |
| 1.3 | Authentication & Session Management | β Complete |
| 2.1 | Document Upload & List | β Complete |
| 2.2 | Virtual Folders | β Complete |
| 2.3 | Document Versioning UI | β Complete |
| 2.4 | Template Builder (Drag & Drop) | β Complete |
| 2.5 | Report Data Entry & PDF Extraction | β Complete |
| 3.1 | BPMN Workflow Visual Editor | β Complete |
| 3.2 | Workflow Execution & State Transitions | β Complete |
| 3.3 | Training Management UI | β Complete |
| 3.4 | Electronic Signatures UI | β Complete |
| 3.5 | Comprehension Quiz & Dual Gate | β Complete |
| 4.1 | Search UI Integration | β Complete |
| 4.2 | RAG Knowledge Base UI | β Complete |
| 4.3 | AI Model Integration (vLLM) | β Complete |
| 4.4 | Multimodal Knowledge Base | β Complete |
| 5.1 | Modular Agent Registry & Personality Framework | β Complete |
| 5.2 | Multi-Agent "Always-On" Auditing | β Complete |
| 5.3 | AI-Enhanced Training Ecosystem | β Complete |
| 5.4 | AI Document Generator (Template-Based) | β Complete |
| 5.5 | AI-Driven Change Impact Analysis | β Complete |
| 5.6 | AI-Powered Traceability & Gap Discovery | β Complete |
| 6.1 | Admin Dashboard β User Management | β Complete |
See the full roadmap in todo.md.
User guides for each major feature are available in the docs/ directory:
| Guide | Description |
|---|---|
| Setup Wizard | First-run initialization: admin account, company, AI mode |
| Document Upload | Uploading documents via web UI and bulk CLI tool |
| Workflow Editor | Designing BPMN document lifecycle workflows |
| Workflow Execution | Executing state transitions, gate indicators, and audit history |
| Training Management | Training tasks, content viewer, records, gate enforcement, admin view |
| Comprehension Quiz | Quiz taking, scoring, pass/fail, dual gate verification, audit trail |
| Electronic Signatures | PAdES signing, re-authentication, verification, certificate configuration |
| AI Inference | vLLM integration: RAG queries, embeddings, OCR, health monitoring, mock mode |
| Search & Knowledge Base | Hybrid search, RAG knowledge chat, document indexing, visual content |
| Agent Management | Agent registry, archetypes, personality profiles, tuning parameters, hot-reload |
| Multi-Agent Auditing | Parallel review pipeline, audit profiles, compliance scorecards, anomaly detection |
| AI Training Ecosystem | AI planner, material generation, question generation, virtual audits, dynamic feedback |
| AI Document Generator | Template-based document generation, provenance audit trail, cross-references, review workflow |
| Change Impact Analysis | Dependency graph, automatic impact detection, gap analysis, notifications, training task resets |
| Traceability & Gap Discovery | Automated traceability matrices, three-pass matching, orphan detection, coverage metrics, stale link alerts |
| Admin Dashboard β User Management | RBAC roles, user CRUD, permission templates, company memberships, audit-compliant lifecycle management |
AlcoaBase is designed to be deployed quickly via Docker Compose.
- Docker & Docker Compose
- uv β used for all Python dependency management and CLI tooling
- (Optional but recommended) NVIDIA GPU with container toolkit installed for AI features. A CPU-mock mode is available for local testing.
-
Clone the repository:
git clone [https://github.com/your-org/alcoabase.git](https://github.com/your-org/alcoabase.git) cd alcoabase -
Start the environment:
docker compose up -d
-
Run the Setup Wizard:
Open your browser and navigate to http://localhost:3000. You will be greeted by the AlcoaBase Setup Wizard. Follow the steps to create your root administrator account, configure your AI hardware settings, and (optionally) seed the database with demo users, BPMN workflows, and SOPs.
AlcoaBase has three test layers. All use uv run pytest from the src/backend/ directory.
Fast, no external dependencies. Runs against in-memory SQLite.
cd src/backend
uv run pytest --tb=short -qProperty-based tests use Hypothesis to validate correctness invariants (tenant isolation, membership rules, migration backfill, quiz scoring, training gate dual verification, etc.).
Tests the full FastAPI request lifecycle (middleware β dependency β route β DB) using an async in-memory SQLite database. No Docker required.
cd src/backend
uv run pytest tests/integration/ -vHits the real backend running on PostgreSQL to validate migrations, constraint behavior, and end-to-end flows.
# 1. Start the stack
docker compose up -d
# 2. Wait for healthy backend, then run the migration
docker compose exec backend alembic upgrade head
# 3. Run smoke tests
cd src/backend
uv run pytest tests/smoke/ -v
# Optional: point at a different host
SMOKE_TEST_BASE_URL=http://your-host:8080 uv run pytest tests/smoke/ -vSmoke tests generate unique slugs per run, so they're safe to execute repeatedly without cleanup.
To prove to auditors that your local instance of AlcoaBase functions exactly as specified, you can trigger the automated CSV process.
- Navigate to the Admin Dashboard -> Validation.
- Click "Run Full System Validation".
- A dedicated, isolated testing user will automatically run through complete lifecycles (creation, approval, PDF generation, signing, and data extraction).
- Upon completion, a digitally signed CSV Report (PDF) will be deposited directly into your AlcoaBase document repository.
We welcome contributions! Whether it's improving the AI prompts, adding new PDF field types, or enhancing the BPMN engine, please check out our CONTRIBUTING.md for guidelines.
AlcoaBase is designed for fully air-gapped environments where no internet connectivity is available. All AI inference runs locally using pre-downloaded model weights.
Before deploying to an air-gapped environment, download the required model weights on a machine with internet access.
First, sync the project dependencies (this installs huggingface-cli via the huggingface-hub package):
# From the project root
cd src/backend
uv syncThen download models into the project-root models/ directory. AlcoaBase provides two model profiles β pick the one that matches your hardware.
Note: Some models (e.g., Llama) require accepting a license on huggingface.co and authenticating first:
uv run --project src/backend huggingface-cli login
Best for development, testing, and smaller deployments. These are the defaults in .env.example.
| Role | Model | Active Params | Download Size |
|---|---|---|---|
| Chat / Generation | Qwen/Qwen3.6-35B-A3B (MoE) |
~3B | ~8 GB |
| Embedding | Qwen/Qwen3-Embedding-0.6B |
0.6B | ~1.2 GB |
| Vision / OCR | google/gemma-4-E4B-it |
~4B | ~8 GB |
cd ../..
mkdir -p models
# Chat: Qwen3.6 35B MoE (only ~3B active params per token)
uv run --project src/backend hf download Qwen/Qwen3.6-35B-A3B --local-dir models/qwen3.6-35b-a3b
# Embedding: Qwen3-Embedding 0.6B (1024-dim output)
uv run --project src/backend hf download Qwen/Qwen3-Embedding-8B --local-dir models/qwen3-embedding-8b
# or Qwen/Qwen3-Embedding-8B
# or Qwen/Qwen3-Embedding-0.6B
# Vision/OCR: Gemma 4 E4B (native vision + OCR)
uv run --project src/backend hf download google/gemma-4-E4B-it --local-dir models/gemma-4-e4b-itFor production deployments on high-end hardware (e.g., NVIDIA A100/H100/Blackwell).
| Role | Model | Params | Download Size |
|---|---|---|---|
| Chat / Generation | meta-llama/Llama-3.3-70B-Instruct |
70B | ~140 GB |
| Embedding | Qwen/Qwen3-Embedding-8B |
8B | ~16 GB |
| Vision / OCR | Qwen/Qwen2.5-VL-72B-Instruct |
72B | ~145 GB |
cd ../..
mkdir -p models
# Chat: Llama 3.3 70B Instruct (requires license acceptance on HuggingFace)
uv run --project src/backend hf download meta-llama/Llama-3.3-70B-Instruct --local-dir models/llama-3.3-70b-instruct
# Embedding: Qwen3-Embedding 8B (#1 on MTEB multilingual leaderboard, 1024-dim)
uv run --project src/backend hf download Qwen/Qwen3-Embedding-8B --local-dir models/qwen3-embedding-8b
# Vision/OCR: Qwen2.5-VL 72B
uv run --project src/backend hf download Qwen/Qwen2.5-VL-72B-Instruct --local-dir models/qwen2.5-vl-72b-instructSwitching profiles: Update the
MODEL_*variables in your.envto point to the downloaded weights. Both profiles use the same embedding dimension (1024), so no OpenSearch index rebuild is needed when upgrading.
Transfer the models/ directory to the target machine via approved media (USB drive, internal network share, etc.):
# Example: copy to target machine
rsync -avP models/ target-machine:/path/to/alcoabase/models/Configure the model paths in your .env file. The defaults match the Small Profile:
# Model Manager Mode: gpu (production), cpu (fallback), mock (development)
MODEL_MANAGER_MODE=gpu
# ββ Chat / Generation LLM ββ
# SMALL (default):
MODEL_CHAT_NAME=Qwen/Qwen3.6-35B-A3B
MODEL_CHAT_PATH=/models/qwen3.6-35b-a3b
MODEL_CHAT_MAX_GPU_MEMORY_GB=24
# LARGE (uncomment to upgrade):
# MODEL_CHAT_NAME=meta-llama/Llama-3.3-70B-Instruct
# MODEL_CHAT_PATH=/models/llama-3.3-70b-instruct
# MODEL_CHAT_MAX_GPU_MEMORY_GB=60
# ββ Multilingual Embedding Model ββ
# SMALL (default):
MODEL_EMBEDDING_NAME=Qwen/Qwen3-Embedding-0.6B
MODEL_EMBEDDING_PATH=/models/qwen3-embedding-0.6b
MODEL_EMBEDDING_DIMENSION=1024
# LARGE (uncomment to upgrade):
# MODEL_EMBEDDING_NAME=Qwen/Qwen3-Embedding-8B
# MODEL_EMBEDDING_PATH=/models/qwen3-embedding-8b
# MODEL_EMBEDDING_DIMENSION=1024
# ββ Vision / OCR Model ββ
# SMALL (default):
MODEL_OCR_NAME=google/gemma-4-E4B-it
MODEL_OCR_PATH=/models/gemma-4-e4b-it
# LARGE (uncomment to upgrade):
# MODEL_OCR_NAME=Qwen/Qwen2.5-VL-72B-Instruct
# MODEL_OCR_PATH=/models/qwen2.5-vl-72b-instruct
# GPU Configuration
GPU_DEVICE_ID=0The docker-compose.yml mounts the models directory into the vLLM container (configured via MODEL_WEIGHTS_PATH in .env, defaults to ./models):
services:
vllm:
volumes:
- ${MODEL_WEIGHTS_PATH:-./models}:/models:roAI containers are configured with no outbound internet access. The Docker Compose network configuration ensures:
- The vLLM container has no external network access
- All inter-service communication uses the internal Docker network
- No document content, embeddings, or queries leave the deployment
For local development and testing without GPU hardware:
MODEL_MANAGER_MODE=mockMock mode returns:
- Random vectors of the correct embedding dimension (1024) for embedding requests
- Placeholder text responses for LLM completion requests
- Simulated OCR text for scanned PDF processing
This allows full application testing without GPU hardware or downloaded model weights.
This project was developed with assistance from AI coding tools, including kiro, opencode, qwen, claude. All outputs were reviewed, tested, and accepted by the maintainers. AI was used to support development; all architectural decisions and responsibility remain with the authors.
This software is provided as is, without warranty of any kind. The authors and contributors are not responsible for any damages, losses, or issues arising from its use, including design errors, hardware damage, manufacturing mistakes, or data loss. AI-generated suggestions are not a replacement for qualified engineering review, and any safety-critical use requires independent expert validation.
This project is licensed under the Apache 2.0 License.