diff --git a/charts/maestrod/CHANGELOG.md b/charts/maestrod/CHANGELOG.md index da9e314..12fcbc2 100644 --- a/charts/maestrod/CHANGELOG.md +++ b/charts/maestrod/CHANGELOG.md @@ -1,14 +1,27 @@ # Changelog - [Changelog](#changelog) + - [0.7.0 (2026-06-17)](#070-2026-06-17) + - [Added](#added) - [0.6.2 (2026-06-12)](#062-2026-06-12) - [Changed](#changed) - [0.6.1 (2026-05-30)](#061-2026-05-30) - [Changed](#changed-1) - [0.6.0 (2026-05-29)](#060-2026-05-29) - - [Added](#added) - - [0.5.0 (2026-05-27)](#050-2026-05-27) - [Added](#added-1) + - [0.5.0 (2026-05-27)](#050-2026-05-27) + - [Added](#added-2) + +## 0.7.0 (2026-06-17) + +### Added + +- Grafana dashboard for Maestrod, delivered as a sidecar-discovered ConfigMap + (`observability.metrics.grafanaDashboard`, disabled by default). Panels cover + per-route AI token usage (`nutrient.ai.tokens_total`), AI call latency and + reliability, per-route HTTP RED metrics, vision quality/throughput, and process + health. Requires the `serviceMonitor` (or another scrape path) and a Grafana + sidecar watching the `grafana_dashboard` label. ## 0.6.2 (2026-06-12) diff --git a/charts/maestrod/Chart.yaml b/charts/maestrod/Chart.yaml index 693e5de..925b245 100644 --- a/charts/maestrod/Chart.yaml +++ b/charts/maestrod/Chart.yaml @@ -4,7 +4,7 @@ type: application description: Maestrod, the orchestration backend for Nutrient managed cloud workloads. home: https://www.nutrient.io icon: https://cdn.prod.website-files.com/65fdb7696055f07a05048833/66e58e33c3880ff24aa34027_nutrient-logo.png -version: 0.6.2 +version: 0.7.0 appVersion: "1.1.3" keywords: diff --git a/charts/maestrod/README.md b/charts/maestrod/README.md index b9f797b..7fe944c 100644 --- a/charts/maestrod/README.md +++ b/charts/maestrod/README.md @@ -2,7 +2,7 @@ > [!WARNING] This chart is made for internal use by Nutrient. -![Version: 0.6.2](https://img.shields.io/badge/Version-0.6.2-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 1.1.3](https://img.shields.io/badge/AppVersion-1.1.3-informational?style=flat-square) +![Version: 0.7.0](https://img.shields.io/badge/Version-0.7.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 1.1.3](https://img.shields.io/badge/AppVersion-1.1.3-informational?style=flat-square) Maestrod, the orchestration backend for Nutrient managed cloud workloads. @@ -221,6 +221,12 @@ namespace. |-----|-------------|---------| | [`observability`](./values.yaml#L209) | Observability settings for Maestrod. | [...](./values.yaml#L209) | | [`observability.metrics`](./values.yaml#L213) | Metrics integration settings. | [...](./values.yaml#L213) | +| [`observability.metrics.grafanaDashboard`](./values.yaml#L251) | Grafana dashboard delivered as a sidecar-discovered ConfigMap. Requires the `serviceMonitor` (or another scrape path) so Prometheus has Maestrod's `/metrics`, and a Grafana sidecar watching ConfigMaps with the `grafana_dashboard` label. | [...](./values.yaml#L251) | +| [`observability.metrics.grafanaDashboard.configMap`](./values.yaml#L258) | ConfigMap parameters. | [...](./values.yaml#L258) | +| [`observability.metrics.grafanaDashboard.configMap.labels`](./values.yaml#L262) | ConfigMap labels. The Grafana sidecar discovers dashboards by the `grafana_dashboard` label; keep it set unless your sidecar uses a different selector. | `{"grafana_dashboard":"1"}` | +| [`observability.metrics.grafanaDashboard.enabled`](./values.yaml#L254) | Create the Grafana dashboard ConfigMap for Maestrod. | `false` | +| [`observability.metrics.grafanaDashboard.tags`](./values.yaml#L271) | Dashboard tags. | `["Nutrient","maestrod"]` | +| [`observability.metrics.grafanaDashboard.title`](./values.yaml#L268) | Dashboard title. | `*generated*` | | [`observability.metrics.serviceMonitor`](./values.yaml#L218) | Prometheus [ServiceMonitor](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#monitoring.coreos.com/v1.ServiceMonitor) scraping Maestrod's `/metrics` endpoint on the existing `http` Service port. | [...](./values.yaml#L218) | | [`observability.metrics.serviceMonitor.enabled`](./values.yaml#L221) | Create a Prometheus Operator ServiceMonitor for Maestrod. | `false` | | [`observability.metrics.serviceMonitor.honorLabels`](./values.yaml#L242) | Honor labels from scraped metrics. | `false` | @@ -236,53 +242,53 @@ namespace. | Key | Description | Default | |-----|-------------|---------| -| [`lifecycle`](./values.yaml#L296) | [Container lifecycle hooks](https://kubernetes.io/docs/tasks/configure-pod-container/attach-handler-lifecycle-event/). | `{}` | -| [`livenessProbe`](./values.yaml#L268) | [Liveness probe](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) against Maestrod's `/health` HTTP endpoint. Polls less often than readiness and is more forgiving — a failure restarts the container, so this should only trip on true deadlock. Set `livenessProbe: {}` to disable. | [...](./values.yaml#L268) | -| [`readinessProbe`](./values.yaml#L282) | [Readiness probe](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) against Maestrod's `/health` HTTP endpoint. Set `readinessProbe: {}` to disable. | [...](./values.yaml#L282) | -| [`startupProbe`](./values.yaml#L253) | [Startup probe](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) against Maestrod's `/health` HTTP endpoint. Generous `failureThreshold` so a slow initial boot doesn't get killed (10 s × 30 = 5 min budget). Set `startupProbe: {}` to disable. | [...](./values.yaml#L253) | -| [`terminationGracePeriodSeconds`](./values.yaml#L293) | [Termination grace period](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/). | `30` | +| [`lifecycle`](./values.yaml#L324) | [Container lifecycle hooks](https://kubernetes.io/docs/tasks/configure-pod-container/attach-handler-lifecycle-event/). | `{}` | +| [`livenessProbe`](./values.yaml#L296) | [Liveness probe](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) against Maestrod's `/health` HTTP endpoint. Polls less often than readiness and is more forgiving — a failure restarts the container, so this should only trip on true deadlock. Set `livenessProbe: {}` to disable. | [...](./values.yaml#L296) | +| [`readinessProbe`](./values.yaml#L310) | [Readiness probe](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) against Maestrod's `/health` HTTP endpoint. Set `readinessProbe: {}` to disable. | [...](./values.yaml#L310) | +| [`startupProbe`](./values.yaml#L281) | [Startup probe](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) against Maestrod's `/health` HTTP endpoint. Generous `failureThreshold` so a slow initial boot doesn't get killed (10 s × 30 = 5 min budget). Set `startupProbe: {}` to disable. | [...](./values.yaml#L281) | +| [`terminationGracePeriodSeconds`](./values.yaml#L321) | [Termination grace period](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/). | `30` | ### Scheduling | Key | Description | Default | |-----|-------------|---------| -| [`affinity`](./values.yaml#L372) | Node affinity. | `{}` | -| [`autoscaling`](./values.yaml#L303) | [HorizontalPodAutoscaler](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/). When `enabled: true`, the chart's HPA controls the replica count and `replicaCount` is ignored. | [...](./values.yaml#L303) | -| [`autoscaling.behavior`](./values.yaml#L321) | HPA [scaling behaviour](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#configurable-scaling-behavior). | `{}` | -| [`autoscaling.enabled`](./values.yaml#L306) | Enable the HPA. | `false` | -| [`autoscaling.maxReplicas`](./values.yaml#L312) | Maximum replicas. | `10` | -| [`autoscaling.minReplicas`](./values.yaml#L309) | Minimum replicas. | `1` | -| [`autoscaling.targetCPUUtilizationPercentage`](./values.yaml#L315) | Target average CPU utilisation (percentage). `null` disables the metric. | `nil` | -| [`autoscaling.targetMemoryUtilizationPercentage`](./values.yaml#L318) | Target average memory utilisation (percentage). `null` disables the metric. | `nil` | -| [`nodeSelector`](./values.yaml#L369) | [Node selector](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/). | `{}` | -| [`podDisruptionBudget`](./values.yaml#L356) | [PodDisruptionBudget](https://kubernetes.io/docs/tasks/run-application/configure-pdb/). When both `minAvailable` and `maxUnavailable` are non-empty, `maxUnavailable` wins (the two fields are mutually exclusive in Kubernetes). Either field accepts an integer (e.g. `1`) or a percentage string (e.g. `"50%"`). | [...](./values.yaml#L356) | -| [`podDisruptionBudget.create`](./values.yaml#L359) | Create a PodDisruptionBudget for Maestrod. | `false` | -| [`podDisruptionBudget.maxUnavailable`](./values.yaml#L365) | `spec.maxUnavailable`. Integer or percentage string. Takes precedence over `minAvailable`. | `""` | -| [`podDisruptionBudget.minAvailable`](./values.yaml#L362) | `spec.minAvailable`. Integer or percentage string. Ignored when `maxUnavailable` is set. | `1` | -| [`priorityClassName`](./values.yaml#L381) | [PriorityClass](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/) name. | `""` | -| [`replicaCount`](./values.yaml#L335) | Number of replicas. Ignored when `autoscaling.enabled` is `true`. | `3` | -| [`resources`](./values.yaml#L325) | [Resources](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/). | `{"limits":{"cpu":"4","memory":"8Gi"},"requests":{"cpu":"4","memory":"8Gi"}}` | -| [`revisionHistoryLimit`](./values.yaml#L348) | [Revision history limit](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#clean-up-policy). | `1` | -| [`schedulerName`](./values.yaml#L384) | [Scheduler](https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/) name. | `""` | -| [`tolerations`](./values.yaml#L375) | [Node tolerations](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/). | `[]` | -| [`topologySpreadConstraints`](./values.yaml#L378) | [Topology spread constraints](https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/). | `[]` | -| [`updateStrategy`](./values.yaml#L341) | [Update strategy](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#strategy). `rollingUpdate.maxSurge` and `rollingUpdate.maxUnavailable` are `IntOrString` in Kubernetes — both an integer (e.g. `1`) and a percentage string (e.g. `"25%"`) are accepted. | `{"rollingUpdate":{"maxSurge":1,"maxUnavailable":0},"type":"RollingUpdate"}` | +| [`affinity`](./values.yaml#L400) | Node affinity. | `{}` | +| [`autoscaling`](./values.yaml#L331) | [HorizontalPodAutoscaler](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/). When `enabled: true`, the chart's HPA controls the replica count and `replicaCount` is ignored. | [...](./values.yaml#L331) | +| [`autoscaling.behavior`](./values.yaml#L349) | HPA [scaling behaviour](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#configurable-scaling-behavior). | `{}` | +| [`autoscaling.enabled`](./values.yaml#L334) | Enable the HPA. | `false` | +| [`autoscaling.maxReplicas`](./values.yaml#L340) | Maximum replicas. | `10` | +| [`autoscaling.minReplicas`](./values.yaml#L337) | Minimum replicas. | `1` | +| [`autoscaling.targetCPUUtilizationPercentage`](./values.yaml#L343) | Target average CPU utilisation (percentage). `null` disables the metric. | `nil` | +| [`autoscaling.targetMemoryUtilizationPercentage`](./values.yaml#L346) | Target average memory utilisation (percentage). `null` disables the metric. | `nil` | +| [`nodeSelector`](./values.yaml#L397) | [Node selector](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/). | `{}` | +| [`podDisruptionBudget`](./values.yaml#L384) | [PodDisruptionBudget](https://kubernetes.io/docs/tasks/run-application/configure-pdb/). When both `minAvailable` and `maxUnavailable` are non-empty, `maxUnavailable` wins (the two fields are mutually exclusive in Kubernetes). Either field accepts an integer (e.g. `1`) or a percentage string (e.g. `"50%"`). | [...](./values.yaml#L384) | +| [`podDisruptionBudget.create`](./values.yaml#L387) | Create a PodDisruptionBudget for Maestrod. | `false` | +| [`podDisruptionBudget.maxUnavailable`](./values.yaml#L393) | `spec.maxUnavailable`. Integer or percentage string. Takes precedence over `minAvailable`. | `""` | +| [`podDisruptionBudget.minAvailable`](./values.yaml#L390) | `spec.minAvailable`. Integer or percentage string. Ignored when `maxUnavailable` is set. | `1` | +| [`priorityClassName`](./values.yaml#L409) | [PriorityClass](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/) name. | `""` | +| [`replicaCount`](./values.yaml#L363) | Number of replicas. Ignored when `autoscaling.enabled` is `true`. | `3` | +| [`resources`](./values.yaml#L353) | [Resources](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/). | `{"limits":{"cpu":"4","memory":"8Gi"},"requests":{"cpu":"4","memory":"8Gi"}}` | +| [`revisionHistoryLimit`](./values.yaml#L376) | [Revision history limit](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#clean-up-policy). | `1` | +| [`schedulerName`](./values.yaml#L412) | [Scheduler](https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/) name. | `""` | +| [`tolerations`](./values.yaml#L403) | [Node tolerations](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/). | `[]` | +| [`topologySpreadConstraints`](./values.yaml#L406) | [Topology spread constraints](https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/). | `[]` | +| [`updateStrategy`](./values.yaml#L369) | [Update strategy](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#strategy). `rollingUpdate.maxSurge` and `rollingUpdate.maxUnavailable` are `IntOrString` in Kubernetes — both an integer (e.g. `1`) and a percentage string (e.g. `"25%"`) are accepted. | `{"rollingUpdate":{"maxSurge":1,"maxUnavailable":0},"type":"RollingUpdate"}` | ### Restart job | Key | Description | Default | |-----|-------------|---------| -| [`restartJob`](./values.yaml#L391) | Optional CronJob that polls the configured image registry for a new digest on the running `image.tag` and patches the Maestrod Deployment with a refresh annotation to trigger a rollout. Disabled by default. | [...](./values.yaml#L391) | -| [`restartJob.affinity`](./values.yaml#L431) | Affinity for the restart-job pod. | `{}` | -| [`restartJob.enabled`](./values.yaml#L394) | Enable the restart-job CronJob and its supporting RBAC/ServiceAccount. | `false` | -| [`restartJob.image`](./values.yaml#L402) | Image for the restart-job container. Must contain `kubectl`, `curl`, `jq`, and `bash` — `alpine/k8s` covers all four. | [...](./values.yaml#L402) | -| [`restartJob.nodeSelector`](./values.yaml#L425) | Node selector for the restart-job pod. | `{}` | -| [`restartJob.podAnnotations`](./values.yaml#L413) | Pod annotations for the restart-job pod. | `{"skip-auto-labelling":"true"}` | -| [`restartJob.podLabels`](./values.yaml#L417) | Pod labels for the restart-job pod. | `{}` | -| [`restartJob.registryAuthSecretName`](./values.yaml#L410) | Name of a pre-existing `kubernetes.io/dockerconfigjson` Secret holding the registry credentials used to query the image manifest. Required when `restartJob.enabled: true`; rendering fails otherwise. | `""` | -| [`restartJob.schedule`](./values.yaml#L397) | CronJob schedule. | `"*/10 * * * *"` | -| [`restartJob.serviceAccount`](./values.yaml#L421) | ServiceAccount for the restart-job pod. | [...](./values.yaml#L421) | -| [`restartJob.tolerations`](./values.yaml#L428) | Tolerations for the restart-job pod. | `[]` | +| [`restartJob`](./values.yaml#L419) | Optional CronJob that polls the configured image registry for a new digest on the running `image.tag` and patches the Maestrod Deployment with a refresh annotation to trigger a rollout. Disabled by default. | [...](./values.yaml#L419) | +| [`restartJob.affinity`](./values.yaml#L459) | Affinity for the restart-job pod. | `{}` | +| [`restartJob.enabled`](./values.yaml#L422) | Enable the restart-job CronJob and its supporting RBAC/ServiceAccount. | `false` | +| [`restartJob.image`](./values.yaml#L430) | Image for the restart-job container. Must contain `kubectl`, `curl`, `jq`, and `bash` — `alpine/k8s` covers all four. | [...](./values.yaml#L430) | +| [`restartJob.nodeSelector`](./values.yaml#L453) | Node selector for the restart-job pod. | `{}` | +| [`restartJob.podAnnotations`](./values.yaml#L441) | Pod annotations for the restart-job pod. | `{"skip-auto-labelling":"true"}` | +| [`restartJob.podLabels`](./values.yaml#L445) | Pod labels for the restart-job pod. | `{}` | +| [`restartJob.registryAuthSecretName`](./values.yaml#L438) | Name of a pre-existing `kubernetes.io/dockerconfigjson` Secret holding the registry credentials used to query the image manifest. Required when `restartJob.enabled: true`; rendering fails otherwise. | `""` | +| [`restartJob.schedule`](./values.yaml#L425) | CronJob schedule. | `"*/10 * * * *"` | +| [`restartJob.serviceAccount`](./values.yaml#L449) | ServiceAccount for the restart-job pod. | [...](./values.yaml#L449) | +| [`restartJob.tolerations`](./values.yaml#L456) | Tolerations for the restart-job pod. | `[]` | ## Contribution diff --git a/charts/maestrod/dashboards/maestrod-single-namespace.json b/charts/maestrod/dashboards/maestrod-single-namespace.json new file mode 100644 index 0000000..edc127f --- /dev/null +++ b/charts/maestrod/dashboards/maestrod-single-namespace.json @@ -0,0 +1,957 @@ +{ + "annotations": { + "list": [ + { + "builtIn": 1, + "datasource": { + "type": "grafana", + "uid": "-- Grafana --" + }, + "enable": true, + "hide": true, + "iconColor": "rgba(0, 211, 255, 1)", + "name": "Annotations & Alerts", + "type": "dashboard" + } + ] + }, + "editable": true, + "fiscalYearStartMonth": 0, + "graphTooltip": 1, + "id": 0, + "links": [], + "panels": [ + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 0 + }, + "id": 100, + "panels": [], + "title": "AI token usage", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "Tokens billed per Maestro action (route), split by input/output. Source: nutrient.ai.tokens_total, tagged with the action that drove the call.", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "tokens/s", + "drawStyle": "line", + "fillOpacity": 10, + "lineWidth": 1, + "stacking": { + "group": "A", + "mode": "normal" + } + }, + "unit": "short" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 1 + }, + "id": 101, + "options": { + "legend": { + "displayMode": "table", + "calcs": [ + "sum" + ], + "placement": "right" + }, + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "expr": "sum by (operation, gen_ai_token_type) (rate(nutrient_ai_tokens_total{namespace=\"$namespace\", pod=~\"$pod\"}[$__rate_interval]))", + "legendFormat": "{{operation}} · {{gen_ai_token_type}}", + "refId": "A" + } + ], + "title": "Token rate by route (input/output)", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "Total tokens consumed per route and model over the selected time range.", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "align": "auto" + }, + "unit": "short" + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "Tokens (range total)" + }, + "properties": [ + { + "id": "custom.cellOptions", + "value": { + "type": "color-text" + } + }, + { + "id": "unit", + "value": "short" + } + ] + } + ] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 1 + }, + "id": 102, + "options": { + "showHeader": true + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "expr": "sum by (operation, gen_ai_request_model, gen_ai_token_type) (increase(nutrient_ai_tokens_total{namespace=\"$namespace\", pod=~\"$pod\"}[$__range]))", + "format": "table", + "instant": true, + "legendFormat": "__auto", + "refId": "A" + } + ], + "title": "Tokens by route + model (range total)", + "type": "table", + "transformations": [ + { + "id": "organize", + "options": { + "excludeByName": { + "Time": true + }, + "renameByName": { + "operation": "Route", + "gen_ai_request_model": "Model", + "gen_ai_token_type": "Token type", + "Value": "Tokens (range total)" + }, + "indexByName": { + "operation": 0, + "gen_ai_request_model": 1, + "gen_ai_token_type": 2, + "Value": 3 + } + } + } + ] + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "MEAI gen_ai.client.token.usage — token distribution per model (input + output), independent of route.", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "drawStyle": "line", + "fillOpacity": 10, + "lineWidth": 1, + "axisLabel": "tokens/s" + }, + "unit": "short" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 9 + }, + "id": 103, + "options": { + "legend": { + "displayMode": "list", + "placement": "bottom" + }, + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "expr": "sum by (gen_ai_request_model, gen_ai_token_type) (rate(gen_ai_client_token_usage_sum{namespace=\"$namespace\", pod=~\"$pod\"}[$__rate_interval]))", + "legendFormat": "{{gen_ai_request_model}} · {{gen_ai_token_type}}", + "refId": "A" + } + ], + "title": "Token rate by model (MEAI)", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "Per-call AI latency (gen_ai.client.operation.duration) p50/p95 by model.", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "drawStyle": "line", + "fillOpacity": 0, + "lineWidth": 1, + "axisLabel": "latency (s)" + }, + "unit": "s" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 9 + }, + "id": 104, + "options": { + "legend": { + "displayMode": "list", + "placement": "bottom" + }, + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "expr": "histogram_quantile(0.95, sum by (le, gen_ai_request_model) (rate(gen_ai_client_operation_duration_seconds_bucket{namespace=\"$namespace\", pod=~\"$pod\"}[$__rate_interval])))", + "legendFormat": "p95 · {{gen_ai_request_model}}", + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "expr": "histogram_quantile(0.50, sum by (le, gen_ai_request_model) (rate(gen_ai_client_operation_duration_seconds_bucket{namespace=\"$namespace\", pod=~\"$pod\"}[$__rate_interval])))", + "legendFormat": "p50 · {{gen_ai_request_model}}", + "refId": "B" + } + ], + "title": "AI call latency (p50/p95)", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "AI reliability: retry round-trips, empty-result fallbacks, and pipeline warnings.", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "drawStyle": "line", + "fillOpacity": 10, + "lineWidth": 1, + "axisLabel": "events/s" + }, + "unit": "short" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 24, + "x": 0, + "y": 17 + }, + "id": 105, + "options": { + "legend": { + "displayMode": "list", + "placement": "bottom" + }, + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "expr": "sum by (gen_ai_request_model) (rate(nutrient_ai_attempts_total{namespace=\"$namespace\", pod=~\"$pod\"}[$__rate_interval]))", + "legendFormat": "attempts · {{gen_ai_request_model}}", + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "expr": "sum (rate(nutrient_ai_empty_result_total{namespace=\"$namespace\", pod=~\"$pod\"}[$__rate_interval]))", + "legendFormat": "empty results", + "refId": "B" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "expr": "sum (rate(nutrient_ai_warnings_total{namespace=\"$namespace\", pod=~\"$pod\"}[$__rate_interval]))", + "legendFormat": "warnings", + "refId": "C" + } + ], + "title": "AI attempts / empty results / warnings", + "type": "timeseries" + }, + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 25 + }, + "id": 200, + "panels": [], + "title": "Per-route HTTP (RED)", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "Request rate per /run/* route.", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "drawStyle": "line", + "fillOpacity": 10, + "lineWidth": 1, + "axisLabel": "requests/s" + }, + "unit": "reqps" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 8, + "x": 0, + "y": 26 + }, + "id": 201, + "options": { + "legend": { + "displayMode": "list", + "placement": "bottom" + }, + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "expr": "sum by (http_route) (rate(http_server_request_duration_seconds_count{namespace=\"$namespace\", pod=~\"$pod\", http_route=~\"/run/.*|/compose\"}[$__rate_interval]))", + "legendFormat": "{{http_route}}", + "refId": "A" + } + ], + "title": "Request rate by route", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "Error ratio (5xx) per route.", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "drawStyle": "line", + "fillOpacity": 10, + "lineWidth": 1, + "axisLabel": "error ratio" + }, + "unit": "percentunit", + "min": 0 + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 8, + "x": 8, + "y": 26 + }, + "id": 202, + "options": { + "legend": { + "displayMode": "list", + "placement": "bottom" + }, + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "expr": "sum by (http_route) (rate(http_server_request_duration_seconds_count{namespace=\"$namespace\", pod=~\"$pod\", http_route=~\"/run/.*|/compose\", http_response_status_code=~\"5..\"}[$__rate_interval])) / sum by (http_route) (rate(http_server_request_duration_seconds_count{namespace=\"$namespace\", pod=~\"$pod\", http_route=~\"/run/.*|/compose\"}[$__rate_interval]))", + "legendFormat": "{{http_route}}", + "refId": "A" + } + ], + "title": "Error ratio by route", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "p95 request duration per route.", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "drawStyle": "line", + "fillOpacity": 0, + "lineWidth": 1, + "axisLabel": "latency (s)" + }, + "unit": "s" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 8, + "x": 16, + "y": 26 + }, + "id": 203, + "options": { + "legend": { + "displayMode": "list", + "placement": "bottom" + }, + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "expr": "histogram_quantile(0.95, sum by (le, http_route) (rate(http_server_request_duration_seconds_bucket{namespace=\"$namespace\", pod=~\"$pod\", http_route=~\"/run/.*|/compose\"}[$__rate_interval])))", + "legendFormat": "{{http_route}}", + "refId": "A" + } + ], + "title": "p95 latency by route", + "type": "timeseries" + }, + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 34 + }, + "id": 300, + "panels": [], + "title": "Vision quality & throughput", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "Documents processed and pages per document (avg).", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "drawStyle": "line", + "fillOpacity": 10, + "lineWidth": 1, + "axisLabel": "docs/s · avg pages/doc" + }, + "unit": "short" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 35 + }, + "id": 301, + "options": { + "legend": { + "displayMode": "list", + "placement": "bottom" + }, + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "expr": "sum(rate(nutrient_vision_pages_per_document_count{namespace=\"$namespace\", pod=~\"$pod\"}[$__rate_interval]))", + "legendFormat": "documents/s", + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "expr": "sum(rate(nutrient_vision_pages_per_document_sum{namespace=\"$namespace\", pod=~\"$pod\"}[$__rate_interval])) / sum(rate(nutrient_vision_pages_per_document_count{namespace=\"$namespace\", pod=~\"$pod\"}[$__rate_interval]))", + "legendFormat": "avg pages/doc", + "refId": "B" + } + ], + "title": "Document throughput", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "Average words and layout zones detected per page, by capacity.", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "drawStyle": "line", + "fillOpacity": 0, + "lineWidth": 1, + "axisLabel": "count / page" + }, + "unit": "short" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 35 + }, + "id": 302, + "options": { + "legend": { + "displayMode": "list", + "placement": "bottom" + }, + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "expr": "sum by (capacity) (rate(nutrient_vision_words_per_page_sum{namespace=\"$namespace\", pod=~\"$pod\"}[$__rate_interval])) / sum by (capacity) (rate(nutrient_vision_words_per_page_count{namespace=\"$namespace\", pod=~\"$pod\"}[$__rate_interval]))", + "legendFormat": "words/page · {{capacity}}", + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "expr": "sum by (capacity) (rate(nutrient_vision_zones_per_page_sum{namespace=\"$namespace\", pod=~\"$pod\"}[$__rate_interval])) / sum by (capacity) (rate(nutrient_vision_zones_per_page_count{namespace=\"$namespace\", pod=~\"$pod\"}[$__rate_interval]))", + "legendFormat": "zones/page · {{capacity}}", + "refId": "B" + } + ], + "title": "Words & zones per page", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "Classification confidence (avg) — surfaces model-quality regressions.", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "drawStyle": "line", + "fillOpacity": 0, + "lineWidth": 1, + "axisLabel": "confidence (0–1)" + }, + "unit": "percentunit", + "min": 0, + "max": 1 + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 43 + }, + "id": 303, + "options": { + "legend": { + "displayMode": "list", + "placement": "bottom" + }, + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "expr": "sum(rate(nutrient_vision_classification_confidence_sum{namespace=\"$namespace\", pod=~\"$pod\"}[$__rate_interval])) / sum(rate(nutrient_vision_classification_confidence_count{namespace=\"$namespace\", pod=~\"$pod\"}[$__rate_interval]))", + "legendFormat": "avg confidence", + "refId": "A" + } + ], + "title": "Classification confidence", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "Artifact export end-to-end duration p95 by format.", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "drawStyle": "line", + "fillOpacity": 0, + "lineWidth": 1, + "axisLabel": "p95 duration (ms)" + }, + "unit": "ms" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 43 + }, + "id": 304, + "options": { + "legend": { + "displayMode": "list", + "placement": "bottom" + }, + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "expr": "histogram_quantile(0.95, sum by (le, format) (rate(nutrient_vision_artifact_export_duration_ms_milliseconds_bucket{namespace=\"$namespace\", pod=~\"$pod\"}[$__rate_interval])))", + "legendFormat": "p95 · {{format}}", + "refId": "A" + } + ], + "title": "Artifact export duration (p95)", + "type": "timeseries" + }, + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 51 + }, + "id": 400, + "panels": [], + "title": "Process health", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "Working-set memory of the daemon process.", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "drawStyle": "line", + "fillOpacity": 10, + "lineWidth": 1, + "axisLabel": "working-set bytes" + }, + "unit": "bytes" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 0, + "y": 52 + }, + "id": 401, + "options": { + "legend": { + "displayMode": "list", + "placement": "bottom" + }, + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "expr": "sum by (pod) (dotnet_process_memory_working_set_bytes{namespace=\"$namespace\", pod=~\"$pod\"})", + "legendFormat": "{{pod}}", + "refId": "A" + } + ], + "title": "Working-set memory", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "Process CPU usage (cores).", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "drawStyle": "line", + "fillOpacity": 10, + "lineWidth": 1, + "axisLabel": "cores" + }, + "unit": "short" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 12, + "x": 12, + "y": 52 + }, + "id": 402, + "options": { + "legend": { + "displayMode": "list", + "placement": "bottom" + }, + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "expr": "sum by (pod) (rate(dotnet_process_cpu_time_seconds_total{namespace=\"$namespace\", pod=~\"$pod\"}[$__rate_interval]))", + "legendFormat": "{{pod}}", + "refId": "A" + } + ], + "title": "CPU usage (cores)", + "type": "timeseries" + } + ], + "refresh": "30s", + "schemaVersion": 39, + "tags": [ + "<<<>>>" + ], + "templating": { + "list": [ + { + "allowCustomValue": false, + "current": { + "text": "<<<>>>", + "value": "<<<>>>" + }, + "label": "Namespace", + "name": "namespace", + "options": [ + { + "selected": true, + "text": "<<<>>>", + "value": "<<<>>>" + } + ], + "query": "<<<>>>", + "type": "custom" + }, + { + "current": { + "text": "All", + "value": [ + "$__all" + ] + }, + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "definition": "label_values(http_server_request_duration_seconds_count{namespace=~\"$namespace\"},pod)", + "includeAll": true, + "label": "Pod", + "multi": true, + "name": "pod", + "options": [], + "query": { + "qryType": 1, + "query": "label_values(http_server_request_duration_seconds_count{namespace=~\"$namespace\"},pod)", + "refId": "PrometheusVariableQueryEditor-VariableQuery" + }, + "refresh": 2, + "regex": "", + "type": "query" + } + ] + }, + "time": { + "from": "now-6h", + "to": "now" + }, + "timepicker": {}, + "timezone": "browser", + "title": "<<<>>>", + "uid": "<<<>>>", + "version": 1 +} diff --git a/charts/maestrod/templates/monitoring/grafana-dashboard.ConfigMap.yaml b/charts/maestrod/templates/monitoring/grafana-dashboard.ConfigMap.yaml new file mode 100644 index 0000000..8333a38 --- /dev/null +++ b/charts/maestrod/templates/monitoring/grafana-dashboard.ConfigMap.yaml @@ -0,0 +1,23 @@ +{{- if .Values.observability.metrics.grafanaDashboard.enabled }} +apiVersion: v1 +kind: ConfigMap +metadata: + name: {{ include "maestrod.fullname" . }}-dashboard + namespace: {{ .Release.Namespace | quote }} + labels: + {{- include "maestrod.labels" . | nindent 4 }} + {{- with .Values.observability.metrics.grafanaDashboard.configMap.labels }} + {{- toYaml . | nindent 4 }} + {{- end }} +data: + maestrod-{{ printf "%s-%s" .Release.Namespace .Release.Name }}.json: |- +{{ .Files.Get "dashboards/maestrod-single-namespace.json" + | replace "<<<>>>" (tpl .Values.observability.metrics.grafanaDashboard.title $) + | replace "<<<>>>" (join "\", \"" .Values.observability.metrics.grafanaDashboard.tags) + | replace "<<<>>>" (printf "%s-%s" .Release.Namespace .Release.Name | trunc 40 | trimSuffix "-") + | replace "<<<>>>" .Release.Namespace + | indent 4 }} + + # Bumps the ConfigMap checksum on chart upgrade so the Grafana sidecar reloads the dashboard. + HELM_CHART_VERSION: {{ include "maestrod.chart" . | quote }} +{{- end }} diff --git a/charts/maestrod/values.schema.json b/charts/maestrod/values.schema.json index 8776975..bbcc1c6 100644 --- a/charts/maestrod/values.schema.json +++ b/charts/maestrod/values.schema.json @@ -191,6 +191,28 @@ "metrics": { "type": "object", "properties": { + "grafanaDashboard": { + "type": "object", + "properties": { + "enabled": { + "type": "boolean" + }, + "configMap": { + "type": "object", + "properties": { + "labels": { + "type": "object" + } + } + }, + "title": { + "type": "string" + }, + "tags": { + "type": "array" + } + } + }, "serviceMonitor": { "type": "object", "properties": { diff --git a/charts/maestrod/values.yaml b/charts/maestrod/values.yaml index df2bc0b..05c4756 100644 --- a/charts/maestrod/values.yaml +++ b/charts/maestrod/values.yaml @@ -243,6 +243,34 @@ observability: # -- ServiceMonitor job label. # @section -- D. Observability jobLabel: "" + # -- (object) Grafana dashboard delivered as a sidecar-discovered ConfigMap. Requires the + # `serviceMonitor` (or another scrape path) so Prometheus has Maestrod's `/metrics`, and a + # Grafana sidecar watching ConfigMaps with the `grafana_dashboard` label. + # @section -- D. Observability + # @notationType -- reference + grafanaDashboard: + # -- Create the Grafana dashboard ConfigMap for Maestrod. + # @section -- D. Observability + enabled: false + # -- (object) ConfigMap parameters. + # @section -- D. Observability + # @notationType -- reference + configMap: + # -- (object) ConfigMap labels. The Grafana sidecar discovers dashboards by the + # `grafana_dashboard` label; keep it set unless your sidecar uses a different selector. + # @section -- D. Observability + labels: + grafana_dashboard: "1" + # -- (tpl/string) Dashboard title. + # @notationType -- tpl + # @section -- D. Observability + # @default -- *generated* + title: "Maestrod ({{ .Release.Namespace }}/{{ include \"maestrod.fullname\" . }})" + # -- Dashboard tags. + # @section -- D. Observability + tags: + - Nutrient + - maestrod # -- (object) [Startup probe](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) # against Maestrod's `/health` HTTP endpoint. Generous `failureThreshold` so a