Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
138 commits
Select commit Hold shift + click to select a range
dc29158
Adjusting more the cpu to fit better on DO hosts
rodrodsilo Feb 16, 2026
375aff7
feat: update docs to current cluster-forge state
brownzebra Feb 17, 2026
ca9d0d9
fix: resources in prd
brownzebra Feb 17, 2026
e703f6b
chore: removing dev mode from docs
brownzebra Feb 18, 2026
ddc7761
fix: bump airm version to 0.3.3
brownzebra Feb 18, 2026
5bbe9fe
chore: bump version in values file
brownzebra Feb 18, 2026
d0ba511
chore: update cmponents for sbom
brownzebra Feb 18, 2026
397e19a
Merge pull request #596 from silogen/EAI-944-resource-manager-helm-ch…
brownzebra Feb 18, 2026
f8e9760
fix: Copy the tls to the argocd namespace and add this as a rootCA to…
pwistbac Feb 18, 2026
7dc274a
feat: add --target-revision flag with git ancestry validation
Q-Dub Feb 17, 2026
b5a0898
fix: set targetRevision within cluster-values
Q-Dub Feb 18, 2026
ded11b6
chore: revert defunct feature relics
Q-Dub Feb 18, 2026
fda6ed4
docs: add one line at bootstrap_guide
woojae-siloai Feb 19, 2026
39003c0
Merge pull request #595 from silogen/EAI-1332-document-cluster-forge
brownzebra Feb 19, 2026
d1869e8
Merge pull request #590 from silogen/feature-decrease-capacity-needed…
brownzebra Feb 19, 2026
ee5db21
Merge pull request #594 from silogen/EAI-1436_feat_target_revision_flag
Q-Dub Feb 19, 2026
a244207
chore: bump airm to 0.3.4
brownzebra Feb 19, 2026
91227c6
Merge pull request #598 from silogen/EAI-944-resource-manager-helm-ch…
brownzebra Feb 19, 2026
fe4d359
bump-airm-version-values-sbom
brownzebra Feb 19, 2026
44416dd
Merge pull request #599 from silogen/bump-airm-version-values-sbom
brownzebra Feb 19, 2026
54c379e
Update version to v1.8.0-rc3 [actions skip]
brownzebra Feb 19, 2026
04810df
Update version to v1.8.0-rc4 [actions skip]
brownzebra Feb 19, 2026
85cef48
chore: bump airm to 0.3.5
brownzebra Feb 20, 2026
e1a9bfd
Merge pull request #602 from silogen/EAI-944-resource-manager-helm-ch…
brownzebra Feb 20, 2026
72088e1
cleanup-workflow-to-not-commit-target-revision
brownzebra Feb 20, 2026
e39c6ec
target-revision-main-update-airm-version
brownzebra Feb 20, 2026
745dc57
Merge pull request #604 from silogen/cleanup-workflow-to-not-commit-t…
brownzebra Feb 20, 2026
75f056d
Merge branch 'main' into target-revision-main-update-airm-version
brownzebra Feb 20, 2026
4490d9c
Merge pull request #605 from silogen/target-revision-main-update-airm…
Q-Dub Feb 20, 2026
56c00b9
simplifty-target-revision-bootstrap
brownzebra Feb 20, 2026
4650d06
Merge pull request #606 from silogen/simplifty-target-revision-bootstrap
brownzebra Feb 20, 2026
fe019cd
feature: Improve liveness condition
rodrodsilo Feb 24, 2026
2ecd906
fix: minor fix
rodrodsilo Feb 24, 2026
098c5a3
fix: correcting script for sh shell
rodrodsilo Feb 25, 2026
e8fe2e6
Merge pull request #609 from silogen/feature-improve-otel-processes-r…
brownzebra Feb 25, 2026
db83b73
fix: remove kyverno local-path policies from kyverno-config and value…
Q-Dub Feb 25, 2026
7bf3f96
put the access mode mutation into templates dir
brownzebra Feb 25, 2026
fa5e8b6
escaping special characters in kyverno template
brownzebra Feb 25, 2026
a4867e8
ignore background requests and esixting violations fields in upgrade …
brownzebra Feb 25, 2026
4519e74
Merge pull request #610 from silogen/hotfix_local_path_policy
Q-Dub Feb 25, 2026
4bb6777
fix(values_small.yaml): was missing base kyverno policies
Q-Dub Feb 26, 2026
5343776
fix: app dependency sequence via waves
Q-Dub Feb 26, 2026
3f478f2
fix: add ignoreDifferences section for keify-otel
Q-Dub Feb 26, 2026
73fe6ef
feat: add custom keda/scaledObject health check at argo app level
Q-Dub Feb 26, 2026
c467140
Merge branch 'main' into EAI-1454_AIM_install_deps
Q-Dub Feb 26, 2026
d25d7aa
fix: remove trailing pipe
Q-Dub Feb 26, 2026
38005c9
Merge pull request #611 from silogen/EAI-1454_AIM_install_deps
Q-Dub Feb 26, 2026
0f08287
Merge pull request #612 from silogen/hotfix_localpath_kyverno
brownzebra Feb 26, 2026
fbf2e1f
feat: add modelcache resource contraint policy (wiring still pending)
Q-Dub Feb 26, 2026
f0364d7
refactor: move enabledApps into large for greater clarity, instead of…
Q-Dub Feb 26, 2026
33bd424
feat: add Chart.yaml and templates folder for new kyverno policy
Q-Dub Feb 26, 2026
cded607
fix: add app definition
Q-Dub Feb 26, 2026
ae88116
fix: new base policy (resource constraints on modelcache)
Q-Dub Feb 26, 2026
1d899e4
fix: rm large relic
Q-Dub Feb 26, 2026
240cdbe
Merge pull request #613 from silogen/EAI-1620_constrain_modelcache_re…
Q-Dub Feb 26, 2026
b10a02a
add cluster-policy to kaiwo
woojae-siloai Mar 2, 2026
ae1441c
move cluster-policy for kaiwo to kyverno-policies
woojae-siloai Mar 2, 2026
c2e10f4
delete cluster-policy from kaiwo
woojae-siloai Mar 2, 2026
83f8e6d
Merge pull request #616 from silogen/add_kaiwo_cluster_policy
Q-Dub Mar 2, 2026
64dd39f
fix: remove duplicate ignoreDifferences (and scope for entire array, …
Q-Dub Mar 2, 2026
844dc6f
refactor: bootsrap without yq
Q-Dub Feb 27, 2026
8d84f71
feat: remove forceful bootstrap of cluster-forge child apps
Q-Dub Feb 27, 2026
72cb3c8
qa: verify new bootstrap by removing component on branch (AIRM)
Q-Dub Feb 27, 2026
a923235
fix: integrate Opoenbao and Gitea in to bootstrap_v2
Q-Dub Feb 27, 2026
7193db6
fix: gitea bootstraping
Q-Dub Feb 27, 2026
e3c1ba5
fix: openbao needed before Gitea
Q-Dub Feb 27, 2026
a983c6d
fix: openbao deployment
Q-Dub Feb 27, 2026
6a051a0
fix: argo crd ordering error
Q-Dub Feb 27, 2026
b70834a
fix: add back AIRM as enabled app after successful test
Q-Dub Feb 27, 2026
e62f54d
feat(bootstrap): only deploy cluster-forge Argo app, children sync vi…
Q-Dub Feb 27, 2026
509500c
feat: rm yq dependency and check; fix: rm partially implemented clean…
Q-Dub Feb 27, 2026
7276dd5
feat: separate values for argo and gitea, so yq can completely be
Q-Dub Feb 27, 2026
ebc4e11
chore: rm bootstrap v2, as ideas have been merged into base bootstrap.sh
Q-Dub Feb 27, 2026
1868d4f
chore: streamline openbao manifest applies
Q-Dub Feb 27, 2026
b43683f
fix: openbao deployment flow
Q-Dub Feb 27, 2026
56bb57a
chore: rm partial cleanup function
Q-Dub Feb 27, 2026
54b7a77
refactor: simplify bootstrap script
Q-Dub Feb 27, 2026
0eac7a8
fix: ordering issue
Q-Dub Feb 27, 2026
95818fa
feat: allow parent cluster-forge app to bootstrap openbao (thereby si…
Q-Dub Feb 28, 2026
91f4a9c
feat(bootstrap.sh): check for dependencies with guidance for resolving
Q-Dub Feb 28, 2026
4a5f5b0
feat: enable child app rendering
Q-Dub Feb 28, 2026
7d3537d
fix: gitea-init-job race condition
Q-Dub Feb 28, 2026
51fb6c0
fix: inject cluster size values file
Q-Dub Feb 28, 2026
b133d92
fix: openbao health check
Q-Dub Mar 1, 2026
3201b05
qa: add xtrace flag for debugging
Q-Dub Mar 1, 2026
d4cd995
fix: gitea-init cm
Q-Dub Mar 1, 2026
456466d
fix: openbao health cheks; sync wave gap restructure
Q-Dub Mar 1, 2026
7a8154a
fix: openbao config timing and dep chain
Q-Dub Mar 1, 2026
b5dd7f9
fix: add openbao-init to enabled apps
Q-Dub Mar 1, 2026
8ed5cc7
feat: update SBOM scripts to collate from all cluster sizes and not d…
Q-Dub Mar 1, 2026
76985cb
feat: improved error messaging, dependency validation, and progress v…
Q-Dub Mar 1, 2026
633632c
fix: openbao cm refs; simplify some redundant bootstrap checks
Q-Dub Mar 1, 2026
71bd3e5
fix: rm sessionAffinity warning for Gitea
Q-Dub Mar 1, 2026
abfdb6f
fix: KC resourcePreset for medium
Q-Dub Mar 1, 2026
d5255ad
fix: remove non-manifest printing when using --apps / --template-only…
Q-Dub Mar 2, 2026
1ec0892
qa: revert init-gitea-job scripts to see if needed with restructuring
Q-Dub Mar 2, 2026
d73d798
qa: rm unneeded debug script
Q-Dub Mar 2, 2026
cbc73c1
refactor: move Gitea sessionAffinity warning from vendor source to va…
Q-Dub Mar 2, 2026
e3a803a
refactor: alpha-sort apps
Q-Dub Mar 2, 2026
9ef87d6
rm values backup file
Q-Dub Mar 2, 2026
4198a2b
fix remove main as hard-coded revision, since bootstrap.sh always clo…
Q-Dub Mar 2, 2026
4a88c30
perf(Keycloak): improve use of heap memory mgmt (solves OOM issues de…
Q-Dub Mar 2, 2026
634ed92
docs: update to reflect restructuring
Q-Dub Mar 2, 2026
8070fe6
fix: don't clobber cluster-values revision
Q-Dub Mar 2, 2026
4bb510f
ux: have gitea-init-job retry with same job so you don't see multiple…
Q-Dub Mar 2, 2026
2341e85
perf: improve gitea init job readiness check
Q-Dub Mar 2, 2026
382079c
ux: rm gitea sessionAffinity warning
Q-Dub Mar 2, 2026
093bb88
perf: tweak minio create user cronjob, as got OOMKilled
Q-Dub Mar 2, 2026
e4b4dbc
fix: remove duplicate ignoreDifferences (and scope for entire array, …
Q-Dub Mar 2, 2026
8fbfa7b
EAI-1238 deprecate scripts/init-openbao-job/values.yaml
brownzebra Mar 3, 2026
9b20d3c
feat: implement AIRM_IMAGE_REPOSITORY env when running bootstrap.sh (…
Q-Dub Mar 2, 2026
45176ab
docs: document AIRM_IMAGE_REPOSITORY feature
Q-Dub Mar 3, 2026
1ad26f8
feat: refactor AIRM_IMAGE_REPOSITORY as command arg as opposed to env…
Q-Dub Mar 3, 2026
466b1eb
fix: revert to dedicated openbao deploy before cluster-forge parent a…
Q-Dub Mar 3, 2026
4812ac5
fix: revert openbao and gitea scripts to match main, preserving --air…
Q-Dub Mar 3, 2026
607e5e2
fix: add proper values files checks
Q-Dub Mar 3, 2026
2e56f4f
qa: add error handling and yq fix
Q-Dub Mar 3, 2026
7f05e92
qa: update check logic for apps.openbao
Q-Dub Mar 3, 2026
b558b91
fix: resolve potential issue with previous run temp file permissions
Q-Dub Mar 3, 2026
82d9064
fix: same temp folder approach as now working openbao
Q-Dub Mar 3, 2026
731b477
fix: airm image seeding for cluster-values
Q-Dub Mar 3, 2026
466d70a
fix: apps key in cluster-values.yaml
Q-Dub Mar 3, 2026
66af8d0
fix: add regcreds for airm images
Q-Dub Mar 3, 2026
56273d4
fix: cp error in airm images
Q-Dub Mar 3, 2026
17ae294
fix: value quotes for aimr images
Q-Dub Mar 3, 2026
a843ff6
fix: add domain when using AIRM_IMAGE_REPOSITORY
Q-Dub Mar 4, 2026
249d550
fix: init-gitea to ref domain without global
Q-Dub Mar 4, 2026
9215c99
fix: adjust syncWaves in light of cluster-auth erroring on restarting…
Q-Dub Mar 4, 2026
b225e21
fix: argo health checks
Q-Dub Mar 4, 2026
e28b767
revert value_ha.yaml (not in use)
Q-Dub Mar 4, 2026
e229500
Merge pull request #618 from silogen/fix_duplicate_key_medium
Q-Dub Mar 4, 2026
7207dc7
Update bootstrap_guide.md
Q-Dub Mar 4, 2026
9d7cca5
Merge branch 'main' into EAI-1235_octopus
brownzebra Mar 4, 2026
d93e501
Merge pull request #620 from silogen/EAI-1235_octopus
Q-Dub Mar 4, 2026
5eeba32
docs: update v1.8.0 changes from recent updates
Q-Dub Mar 4, 2026
30a7d2c
fix(bootstrap.sh): --template-only flag to remove non-template output
Q-Dub Mar 4, 2026
3a5df89
Merge pull request #621 from silogen/docs_update_from_octopus_merge
Q-Dub Mar 4, 2026
b0614c9
Merge pull request #622 from silogen/fix_bootstrap_template_only_flag
Q-Dub Mar 4, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 20 additions & 16 deletions .github/workflows/release-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -42,27 +42,31 @@ jobs:
fi
echo "next=$VERSION" >> $GITHUB_OUTPUT

- name: Update helm values file
uses: mikefarah/yq@master
- name: Validate LATEST_RELEASE matches release version
env:
GIT_TAG: ${{ steps.semver.outputs.next }}
with:
cmd: |
yq -i '.clusterForge.targetRevision = env(GIT_TAG)' root/values.yaml
yq -i '.targetRevision = env(GIT_TAG)' scripts/init-gitea-job/values.yaml

- name: Commit and push changes
uses: stefanzweifel/git-auto-commit-action@v4
env:
GIT_TAG: ${{ steps.semver.outputs.next }}
with:
commit_message: 'Update version to ${{ env.GIT_TAG }} [actions skip]'
VERSION: ${{ steps.semver.outputs.next }}
run: |
# Extract LATEST_RELEASE from bootstrap.sh
LATEST_RELEASE=$(grep '^LATEST_RELEASE=' scripts/bootstrap.sh | cut -d'"' -f2 | sed 's/^v//')

# Extract base version (before -rc or -alpha, etc.)
RELEASE_BASE=$(echo "$VERSION" | sed 's/^v//' | sed 's/-rc[0-9]*$//' | sed 's/-alpha[0-9]*$//' | sed 's/-beta[0-9]*$//')
LATEST_BASE=$(echo "$LATEST_RELEASE" | sed 's/-rc[0-9]*$//' | sed 's/-alpha[0-9]*$//' | sed 's/-beta[0-9]*$//')

echo "Release version: $VERSION (base: $RELEASE_BASE)"
echo "LATEST_RELEASE in bootstrap.sh: $LATEST_RELEASE (base: $LATEST_BASE)"

if [[ "$RELEASE_BASE" != "$LATEST_BASE" ]]; then
echo "::warning::LATEST_RELEASE base version ($LATEST_BASE) in scripts/bootstrap.sh does not match release version base ($RELEASE_BASE)"
echo "::warning::Consider updating LATEST_RELEASE in scripts/bootstrap.sh to match the release being created"
else
echo "✓ LATEST_RELEASE base version matches release version base"
fi

- name: Create GitHub Release
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
VERSION: ${{ steps.semver.outputs.next }}
EXTRA_ARGS: ${{ steps.version.outputs.extra_args }}
run: |
# Prepare release artifact
tar -zcvf "release-enterprise-ai-${VERSION}.tar.gz" --transform 's,^,cluster-forge/,' root/ scripts/ sources
Expand Down Expand Up @@ -134,4 +138,4 @@ jobs:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
SBOM_NAME: ${{ steps.generate_sbom.outputs.sbom_name }}
run: |
gh release upload ${VERSION} ${SBOM_NAME} --clobber
gh release upload ${VERSION} ${SBOM_NAME} --clobber
732 changes: 609 additions & 123 deletions PRD.md

Large diffs are not rendered by default.

209 changes: 125 additions & 84 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,130 +1,169 @@
# Cluster-Forge

**A helper tool that deploys [AMD Enterprise AI Suite](https://enterprise-ai.docs.amd.com/en/latest/) into Kubernetes cluster.**
**A Kubernetes platform automation tool that deploys [AMD Enterprise AI Suite](https://enterprise-ai.docs.amd.com/en/latest/) with complete GitOps infrastructure.**

## Overview

**Cluster-Forge** is a tool designed to bundle various third-party, community, and in-house components into a single, streamlined stack that can be deployed in Kubernetes clusters. By automating the deployment process, Cluster-Forge simplifies the creation of consistent, ready-to-use clusters.
**Cluster-Forge** bundles third-party, community, and in-house components into a single, GitOps-managed stack deployable in Kubernetes clusters. It automates the deployment of a complete AI/ML compute platform with all essential services pre-configured and integrated.

This tool is ideal for scenarios such as:
Using a bootstrap-first deployment model, Cluster-Forge establishes GitOps infrastructure (ArgoCD, Gitea, OpenBao) before deploying the complete application stack via ArgoCD's app-of-apps pattern.

- **Ephemeral test clusters** - Create temporary environments quickly
- **CI/CD pipeline clusters** - Ensure consistent testing environments
- **Multiple production clusters** - Manage a fleet of clusters efficiently
- **Reproducible environments** - Ensure consistency across deployments
**Ideal for:**

- **AI/ML Engineers** - Unified platform for model training, serving, and orchestration
- **Platform Engineers** - Infrastructure automation with GitOps patterns
- **DevOps Teams** - Consistent deployment across development, staging, and production
- **Research Teams** - Ephemeral test clusters for experimentation

## 🚀 Quick Start

### Basic Deployment
### Single-Command Deployment
```bash
./scripts/bootstrap.sh <domain>
./scripts/bootstrap.sh <domain> [--cluster-size=small|medium|large]
```

### Size-Aware Deployment
### Size-Aware Deployment Examples
```bash
# Small cluster (1-5 users, development/testing)
./scripts/bootstrap.sh dev.example.com --CLUSTER_SIZE=small
./scripts/bootstrap.sh dev.example.com --cluster-size=small

# Medium cluster (5-20 users, team production) [DEFAULT]
./scripts/bootstrap.sh team.example.com --CLUSTER_SIZE=medium
./scripts/bootstrap.sh team.example.com --cluster-size=medium

# Large cluster (10s-100s users, enterprise scale)
./scripts/bootstrap.sh prod.example.com --CLUSTER_SIZE=large
./scripts/bootstrap.sh prod.example.com --cluster-size=large

# Deploy only specific components
./scripts/bootstrap.sh dev.example.com --apps=argocd,gitea,cluster-forge

# Deploy from specific branch/tag
./scripts/bootstrap.sh prod.example.com --target-revision=v1.8.0
```

For detailed deployment instructions, see the [Bootstrap Guide](docs/bootstrap_guide.md).

## 📋 Workflow
## 📋 Architecture

### Bootstrap-First Deployment

Cluster-Forge uses a three-phase bootstrap process:

Cluster-Forge deploys all necessary components within the cluster using GitOps-controller [ArgoCD](https://argo-cd.readthedocs.io/)
and [app-of-apps pattern](https://argo-cd.readthedocs.io/en/stable/operator-manual/cluster-bootstrapping/#app-of-apps-pattern) where Cluster-Forge acts as an app of apps.
**Phase 1: Pre-Cleanup**
- Detects and removes previous installations when applicable
- Ensures clean state for fresh deployments

### GitOps Architecture
**Phase 2: GitOps Foundation Bootstrap** (Manual Helm Templates)
1. **ArgoCD** (v8.3.5) - GitOps controller deployed via helm template
2. **Gitea** (v12.3.0) - Git server with initialization job

Cluster-Forge supports two deployment modes:
- **External Mode**: Traditional GitOps with GitHub dependency
- **Local Mode**: Self-contained GitOps with local Gitea
**Phase 3: App-of-Apps Deployment** (ArgoCD-Managed)
- Creates cluster-forge Application pointing to root/ helm chart
- ArgoCD syncs all remaining applications including OpenBao from enabledApps list
- Applications deployed in wave order (-70 to 0) based on dependencies
- OpenBao (v0.18.2) managed via ArgoCD with openbao-init job

See [Values Inheritance Pattern](docs/values_inheritance_pattern.md) for detailed architecture documentation.
### Dual Repository GitOps Pattern

**Local Mode (Default)** - Self-contained cluster-native GitOps:
- Uses local Gitea for both cluster-forge and cluster-values repositories
- Zero external dependencies once bootstrapped
- Initialization handled by gitea-init-job

**External Mode** - Traditional GitHub-based GitOps:
- Points to external GitHub repository
- Supports custom branch selection for testing

See [Values Inheritance Pattern](docs/values_inheritance_pattern.md) for detailed architecture.

## 🛠️ Components

### Layer 1: GitOps Foundation
- **ArgoCD** - GitOps controller for continuous deployment
- **Gitea** - Git repository server for source management
- **OpenBao** - Vault-compatible secret management system
- **ArgoCD 8.3.5** - GitOps continuous deployment controller
- **Gitea 12.3.0** - Self-hosted Git server with SQLite backend
- **OpenBao 0.18.2** - Vault-compatible secrets management
- **External Secrets 0.15.1** - Secrets synchronization operator

### Layer 2: Core Infrastructure

**Networking & Security:**
- **Gateway API + KGateway** - Modern ingress and traffic management
- **Cert-Manager** - Automated TLS certificate management
- **MetalLB** - Load balancer for bare metal environments
- **External Secrets Operator** - External secret integration
- **Cilium** - Network security and observability
- **Kyverno** - Policy engine with modular policy system
- **Gateway API v1.3.0** - Kubernetes standard ingress API
- **KGateway v2.1.0-main** - Gateway API implementation with WebSocket support
- **MetalLB v0.15.2** - Bare metal load balancer
- **Cert-Manager v1.18.2** - Automated TLS certificate management
- **Kyverno 3.5.1** - Policy engine with modular policy system

**Storage & Database:**
- **Longhorn** - Distributed block storage
- **CNPG Operator** - Cloud-native PostgreSQL management
- **MinIO Operator + Tenant** - S3-compatible object storage

### Layer 3: Observability & Monitoring
- **Prometheus** - Metrics collection and alerting
- **Grafana** - Visualization and dashboarding
- **OpenTelemetry Operator** - Distributed tracing and telemetry
- **OTEL-LGTM Stack** - Unified observability platform (Loki, Grafana, Tempo, Mimir)

### Layer 4: AI/ML Compute Stack
**GPU & Compute:**
- **AMD GPU Operator** - GPU device management and drivers
- **KubeRay Operator** - Ray distributed computing framework
- **KServe** - Kubernetes-native model serving
- **Kueue** - Advanced job queueing system
- **AppWrapper** - Application scheduling and resource management
- **KEDA** - Event-driven autoscaling

**Workflow & Orchestration:**
- **Kaiwo** - Workflow management system
- **RabbitMQ** - Message broker for async processing

### Layer 5: Identity & Access
- **Keycloak** - Enterprise identity and access management
- **Cluster-Auth** - Kubernetes RBAC integration

### Layer 6: AIRM App
- **AIRM API** - Central API layer for AMD Resource Manager
- **AIRM UI** - Frontend interface for resource management
- **AIRM Dispatcher** - Compute workload dispatching agent

## 💾 Storage Classes

Storage classes are provided by default with Longhorn. These can be customized as needed.

| Purpose | StorageClass | Access Mode | Locality |
|---------|--------------|-------------|----------|
| GPU Job | mlstorage | RWO | LOCAL/remote |
| GPU Job | default | RWO | LOCAL/remote |
| Advanced usage | direct | RWO | LOCAL |
| Multi-container | multinode | RWX | ANYWHERE |

## 📄 Configuration
- **CNPG Operator 0.26.0** - CloudNativePG PostgreSQL operator
- **MinIO Operator 7.1.1** - S3-compatible object storage operator
- **MinIO Tenant 7.1.1** - Tenant deployment with default-bucket and models buckets

### Cluster Sizing
### Layer 3: Observability
- **Prometheus Operator CRDs 23.0.0** - Metrics infrastructure
- **OpenTelemetry Operator 0.93.1** - Telemetry collection
- **OTEL-LGTM Stack v1.0.7** - Integrated observability (Loki, Grafana, Tempo, Mimir)

Cluster-Forge provides three pre-configured cluster profiles:
### Layer 4: Identity & Access
- **Keycloak** (keycloak-old chart) - Enterprise IAM with AIRM realm
- **Cluster-Auth 0.5.0** - Kubernetes RBAC integration

- **Small**: Minimal resources, local-path storage, RWX→RWO access mode conversion
- **Medium**: Balanced resources, local-path storage, RWX→RWO access mode conversion
- **Large**: Full enterprise features, Longhorn storage, native RWX support
### Layer 5: AI/ML Compute Stack

**GPU & Scheduling:**
- **AMD GPU Operator v1.4.1** - GPU device plugin and drivers
- **KubeRay Operator 1.4.2** - Ray distributed computing framework
- **Kueue 0.13.0** - Job queueing with multi-framework support
- **AppWrapper v1.1.2** - Application-level resource scheduling
- **KEDA 2.18.1** - Event-driven autoscaling

**ML Serving & Inference:**
- **KServe v0.16.0** - Model serving platform (Standard deployment mode)

**Workflow & Messaging:**
- **Kaiwo v0.2.0-rc11** - AI workload orchestration
- **RabbitMQ v2.15.0** - Message broker for async processing

### Layer 6: AIRM Application
- **AIRM 0.3.2** - AMD Resource Manager application suite
- **AIM Cluster Model Source** - Cluster resource models for AIRM
- **Configurable Image Repositories** - Supports custom container registries via cluster-bloom `AIRM_IMAGE_REPOSITORY` parameter

## � Configuration

### Cluster Sizing

Three cluster profiles with inheritance-based resource optimization:

**Small Clusters** (1-5 users, dev/test):
- Single replica deployments
- Reduced resource limits (ArgoCD controller: 2 CPU, 4Gi RAM)
- Adds kyverno-policies-storage-local-path for RWX→RWO PVC mutation
- MinIO tenant: 250Gi storage
- Suitable for: Local workstations, development environments

**Medium Clusters** (5-20 users, team production):
- Single replica with moderate resource allocation
- Same storage policies as small (local-path support)
- ArgoCD controller: 2 CPU, 4Gi RAM
- Default configuration for balanced performance
- Suitable for: Small teams, staging environments

**Large Clusters** (10s-100s users, enterprise scale):
- OpenBao HA: 3 replicas with Raft consensus
- No local-path policies (assumes distributed storage)
- MinIO tenant: 500Gi storage
- Production-grade resource allocation
- Suitable for: Production deployments, multi-tenant environments

See [Cluster Size Configuration](docs/cluster_size_configuration.md) for detailed specifications.

### Values Files

Configuration follows a streamlined inheritance pattern:
- **Base**: 52 common applications with alpha-sorted enabledApps
- **Base**: Common applications with alpha-sorted enabledApps
- **Size-specific**: Only override differences from base (DRY principle)
- **Runtime**: Domain and cluster-specific parameters
- **Runtime**: Domain and cluster-specific parameters injected during bootstrap

The bootstrap script uses YAML merge semantics where size-specific values override base values.yaml settings.

## 📚 Documentation

Expand All @@ -135,11 +174,13 @@ Comprehensive documentation is available in the `/docs` folder:
| **Getting Started** | [Bootstrap Guide](docs/bootstrap_guide.md) |
| **Configuration** | [Cluster Size Configuration](docs/cluster_size_configuration.md) |
| **Architecture** | [Values Inheritance Pattern](docs/values_inheritance_pattern.md) |
| **Security** | [Kyverno Modular Design](docs/kyverno_modular_design.md) |
| **Policies** | [Kyverno Access Mode Policy](docs/kyverno_access_mode_policy.md) |
| **Secrets** | [Secrets Management Architecture](docs/secrets_management_architecture.md) |
| **Policy System** | [Kyverno Modular Design](docs/kyverno_modular_design.md) |
| **Storage Policies** | [Kyverno Access Mode Policy](docs/kyverno_access_mode_policy.md) |
| **Operations** | [Backup and Restore](docs/backup_and_restore.md) |

Additional documentation:
- **SBOM**: See `/sbom` folder for software bill of materials generation and validation

## 📝 License

Cluster-Forge is licensed under the Apache License, Version 2.0. See the [LICENSE](LICENSE) file for details.
Expand Down
Loading