-
Notifications
You must be signed in to change notification settings - Fork 0
Prometheus metrics #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
f7a8bca
feature: added metrics service-monitor for prometheus & victoriametrics
c6b5e4e
- add prometheus metrics
b2a784b
improve makefile
751e389
add prefix to metrics
0bedac3
fix build
621d589
fix alerts
b0a5470
add logs
c295fa6
updates
fb6e2e3
discovery metrics & http retries
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,61 @@ | ||
| # AGENTS.md | ||
|
|
||
| ## Purpose | ||
|
|
||
| This repository contains YTsaurus task proxy: a small Go service that discovers job-local services in YTsaurus and publishes dynamic Envoy xDS config for stable public routing with optional access checks. | ||
|
|
||
| ## Repository Layout | ||
|
|
||
| - `server/main.go` - binary entrypoint. Parses flags, creates the YT client, runs periodic discovery, and serves xDS + `ext_authz` gRPC on port `9090`. | ||
| - `server/pkg/discovery.go` - discovers tasks from running operations. Supports SPYT direct submit, SPYT standalone clusters, and generic operations annotated with `task_proxy`. | ||
| - `server/pkg/auth.go` - Envoy external authorization backend. Resolves task from host or `x-yt-taskproxy-*` headers and checks YTsaurus operation read permissions. | ||
| - `server/pkg/xds.go` - builds Envoy snapshots: listeners, clusters, virtual hosts, header-based routing, optional TLS, and `ext_authz`. | ||
| - `server/pkg/updater.go` - applies the latest snapshot to the cache, refreshes auth lookup data, and writes the `services` table to YT. | ||
| - `chart/` - Helm chart for deploying the `envoy` + `server` pod. | ||
| - `examples/grpc-service/` - sample gRPC service intended to run inside YTsaurus jobs. | ||
|
|
||
| ## Common Commands | ||
|
|
||
| - `make test` - runs unit tests in `server/pkg`. | ||
| - `make build` - builds the Linux `amd64` server binary into `server/server`. | ||
| - `make image RELEASE_VERSION=<version>` - builds the Docker image. | ||
| - `make helm-chart RELEASE_VERSION=<version>` - packages the Helm chart. | ||
|
|
||
| If `make test` fails because the Go tool cannot write to its cache in a sandboxed environment, run tests with a writable cache directory, for example: | ||
|
|
||
| ```sh | ||
| cd server/pkg | ||
| GOCACHE=/tmp/go-build GOTMPDIR=/tmp go test ./... | ||
| ``` | ||
|
|
||
| ## Runtime Model | ||
|
|
||
| - Envoy listens on `8080`. | ||
| - The Go service serves xDS and `ext_authz` on `9090`. | ||
| - Envoy bootstrap config is static in `chart/templates/config.yaml` and points to the local xDS server. | ||
| - Dynamic routing is generated from discovered tasks. Each task may be addressed by: | ||
| - a hash-based subdomain | ||
| - an alias-based subdomain when the YT operation has an alias | ||
| - `x-yt-taskproxy-*` routing headers | ||
|
|
||
| ## Change Guidelines | ||
|
|
||
| - Keep changes in `server/pkg/xds.go` and `server/pkg/auth.go` aligned. If you add or rename routing headers or domain formats, update both routing and authorization lookup logic. | ||
| - Discovery changes should preserve all currently supported task sources unless the change explicitly removes a scenario. | ||
| - `Task.Validate()` constrains alias-based hostnames. If you expand hostname semantics, update validation and tests together. | ||
| - When changing chart templates, keep the relationship between: | ||
| - `chart/templates/config.yaml` | ||
| - `chart/templates/deployment.yaml` | ||
| - `server/pkg/const.go` | ||
| consistent for ports, TLS mount paths, and container wiring. | ||
|
|
||
| ## Testing Expectations | ||
|
|
||
| - Update unit tests in `server/pkg/*_test.go` for behavior changes. | ||
| - `auth_test.go` covers request-to-task resolution precedence. | ||
| - `discovery_test.go` covers parsing of `task_proxy` annotations. | ||
| - `xds_test.go` covers the generated Envoy snapshot shape. | ||
|
|
||
| ## Notes | ||
|
|
||
| - The repo may contain local, untracked artifacts during development. Do not delete unrelated files unless the task explicitly asks for cleanup. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| # CLAUDE.md | ||
|
|
||
| This repository keeps its primary agent instructions in `AGENTS.md`. | ||
|
|
||
| If you are an automated coding agent working in this repo: | ||
|
|
||
| 1. Read `AGENTS.md` first. | ||
| 2. Treat `AGENTS.md` as the source of truth for repository-specific guidance. | ||
| 3. Use this file only as a wrapper/entrypoint for tools or agents that look specifically for `CLAUDE.md`. | ||
|
|
||
| For project context, commands, architecture notes, and change guidelines, see `./AGENTS.md`. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,19 +1,64 @@ | ||
| REPO=ghcr.io/ytsaurus | ||
| REPO := ghcr.io/ytsaurus | ||
|
|
||
| TARGET_OS=linux | ||
| TARGET_ARCH=amd64 | ||
| TARGET_OS := linux | ||
| TARGET_ARCH := amd64 | ||
|
|
||
| ifndef RELEASE_VERSION | ||
| RELEASE_VERSION = 0.0.0 | ||
| RELEASE_VERSION ?= 0.0.0 | ||
|
|
||
| ifdef DOCKER_CONTEXT | ||
| DOCKER_ARGS := --context $(DOCKER_CONTEXT) | ||
| else | ||
| DOCKER_ARGS := | ||
| endif | ||
|
|
||
| BUILD_PLATFORM := $(TARGET_OS)/$(TARGET_ARCH) | ||
| IMAGE_TAG := $(REPO)/task-proxy:$(RELEASE_VERSION) | ||
| CHART_PACKAGE := task-proxy-chart-$(RELEASE_VERSION).tgz | ||
|
|
||
| .PHONY: test | ||
| test: | ||
| cd server/pkg && go test | ||
|
|
||
| release: | ||
| cd server && GOOS=$(TARGET_OS) GOARCH=$(TARGET_ARCH) go build -o server . && cd .. | ||
| docker build --platform $(TARGET_OS)/$(TARGET_ARCH) . -t $(REPO)/task-proxy:$(RELEASE_VERSION) | ||
| docker push $(REPO)/task-proxy:$(RELEASE_VERSION) | ||
| helm package chart | ||
| helm push task-proxy-chart-$(RELEASE_VERSION).tgz oci://$(REPO) | ||
| @echo "🧪 Running tests..." | ||
| cd server/pkg && go test ./... -v | ||
|
|
||
| .PHONY: build | ||
| build: | ||
| @echo "⚙️ Building binary for $(BUILD_PLATFORM)..." | ||
| cd server && \ | ||
| GOOS=$(TARGET_OS) GOARCH=$(TARGET_ARCH) \ | ||
| go build -o server . | ||
| @echo "✅ Binary built: dist/task-proxy" | ||
|
|
||
| .PHONY: image | ||
| image: build test | ||
| @echo "🐳 Building Docker image: $(IMAGE_TAG)..." | ||
| docker $(DOCKER_ARGS) build \ | ||
| --platform $(BUILD_PLATFORM) \ | ||
| -t $(IMAGE_TAG) \ | ||
| . | ||
| @echo "✅ Image built: $(IMAGE_TAG)" | ||
|
|
||
| .PHONY: helm-chart | ||
| helm-chart: image | ||
| @echo "📦 Packaging Helm chart version $(RELEASE_VERSION)..." | ||
| helm package chart \ | ||
| --version $(RELEASE_VERSION) \ | ||
| --app-version $(RELEASE_VERSION) \ | ||
| --destination . | ||
| @echo "✅ Chart packaged: $(CHART_PACKAGE)" | ||
|
|
||
| .PHONY: release | ||
| release: helm-chart | ||
| @echo "🚀 Performing release version $(RELEASE_VERSION)..." | ||
| @echo " → Pushing Docker image..." | ||
| docker $(DOCKER_ARGS) push $(IMAGE_TAG) | ||
| @echo " → Pushing Helm chart..." | ||
| helm push $(CHART_PACKAGE) oci://$(REPO) | ||
| @echo "✅ Release completed: $(RELEASE_VERSION)" | ||
|
|
||
| .PHONY: clean | ||
| clean: | ||
| @echo "🗑️ Cleaning up artifacts..." | ||
| rm -f server/server $(CHART_PACKAGE) | ||
| @echo "✅ Cleanup completed" | ||
|
|
||
| .DEFAULT_GOAL := helm-chart |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,80 @@ | ||
| {{- if and .Values.monitoring.enabled .Values.monitoring.alerts.enabled }} | ||
|
|
||
| {{- if eq .Values.monitoring.engine "prometheus" }} | ||
| apiVersion: monitoring.coreos.com/v1 | ||
| kind: PrometheusRule | ||
| {{- else if eq .Values.monitoring.engine "victoriametrics" }} | ||
| apiVersion: operator.victoriametrics.com/v1beta1 | ||
| kind: VMRule | ||
| {{- else }} | ||
| {{- fail "Unexpected monitoring engine: $.Values.monitoring.engine" }} | ||
| {{- end }} | ||
| metadata: | ||
| name: task-proxy-alerts | ||
| namespace: {{ .Release.Namespace }} | ||
| {{- if eq .Values.monitoring.engine "prometheus" }} | ||
| labels: | ||
| release: prometheus-stack | ||
| {{- end }} | ||
| spec: | ||
| groups: | ||
| - name: task-proxy | ||
| rules: | ||
| - alert: TaskProxyAuthorizationInfrastructureFailures | ||
| expr: sum(increase(yt_task_proxy_auth_infra_errors_total[1m])) > {{ .Values.monitoring.alerts.authInfrastructureFailuresPerMinuteThreshold }} | ||
| labels: | ||
| severity: warning | ||
| annotations: | ||
| summary: Task proxy authorization infrastructure failures exceed threshold | ||
| description: Task proxy observed more than {{ .Values.monitoring.alerts.authInfrastructureFailuresPerMinuteThreshold }} authorization infrastructure failures over the last minute. | ||
| - alert: TaskProxyAuthorizationContextDeadlineExceeded | ||
| expr: sum(increase(yt_task_proxy_auth_infra_errors_total{kind="context_deadline_exceeded"}[1m])) > {{ .Values.monitoring.alerts.authContextDeadlineExceededPerMinuteThreshold }} | ||
| labels: | ||
| severity: warning | ||
| annotations: | ||
| summary: Task proxy authorization context deadline exceeded rate exceeds threshold | ||
| description: Task proxy observed more than {{ .Values.monitoring.alerts.authContextDeadlineExceededPerMinuteThreshold }} context deadline exceeded authorization failures over the last minute. | ||
| - alert: TaskProxyAuthorizationGrpcUnavailable | ||
| expr: sum(increase(yt_task_proxy_auth_infra_errors_total{kind="grpc_unavailable"}[1m])) > {{ .Values.monitoring.alerts.authGrpcUnavailablePerMinuteThreshold }} | ||
| labels: | ||
| severity: warning | ||
| annotations: | ||
| summary: Task proxy authorization gRPC unavailable rate exceeds threshold | ||
| description: Task proxy observed more than {{ .Values.monitoring.alerts.authGrpcUnavailablePerMinuteThreshold }} gRPC unavailable authorization failures over the last minute. | ||
| - alert: TaskProxyAuthorizationConnectionTimeout | ||
| expr: sum(increase(yt_task_proxy_auth_infra_errors_total{kind="connection_timeout"}[1m])) > {{ .Values.monitoring.alerts.authConnectionTimeoutPerMinuteThreshold }} | ||
| labels: | ||
| severity: warning | ||
| annotations: | ||
| summary: Task proxy authorization connection timeout rate exceeds threshold | ||
| description: Task proxy observed more than {{ .Values.monitoring.alerts.authConnectionTimeoutPerMinuteThreshold }} connection timeout authorization failures over the last minute. | ||
| - alert: TaskProxyDiscoveryInfrastructureFailures | ||
| expr: sum(increase(yt_task_proxy_discovery_infra_errors_total[1m])) > {{ .Values.monitoring.alerts.discoveryInfrastructureFailuresPerMinuteThreshold }} | ||
| labels: | ||
| severity: warning | ||
| annotations: | ||
| summary: Task proxy discovery infrastructure failures exceed threshold | ||
| description: Task proxy observed more than {{ .Values.monitoring.alerts.discoveryInfrastructureFailuresPerMinuteThreshold }} discovery infrastructure failures over the last minute. | ||
| - alert: TaskProxyDiscoveryContextDeadlineExceeded | ||
| expr: sum(increase(yt_task_proxy_discovery_infra_errors_total{kind="context_deadline_exceeded"}[1m])) > {{ .Values.monitoring.alerts.discoveryContextDeadlineExceededPerMinuteThreshold }} | ||
| labels: | ||
| severity: warning | ||
| annotations: | ||
| summary: Task proxy discovery context deadline exceeded rate exceeds threshold | ||
| description: Task proxy observed more than {{ .Values.monitoring.alerts.discoveryContextDeadlineExceededPerMinuteThreshold }} context deadline exceeded discovery failures over the last minute. | ||
| - alert: TaskProxyDiscoveryGrpcUnavailable | ||
| expr: sum(increase(yt_task_proxy_discovery_infra_errors_total{kind="grpc_unavailable"}[1m])) > {{ .Values.monitoring.alerts.discoveryGrpcUnavailablePerMinuteThreshold }} | ||
| labels: | ||
| severity: warning | ||
| annotations: | ||
| summary: Task proxy discovery gRPC unavailable rate exceeds threshold | ||
| description: Task proxy observed more than {{ .Values.monitoring.alerts.discoveryGrpcUnavailablePerMinuteThreshold }} gRPC unavailable discovery failures over the last minute. | ||
| - alert: TaskProxyDiscoveryConnectionTimeout | ||
| expr: sum(increase(yt_task_proxy_discovery_infra_errors_total{kind="connection_timeout"}[1m])) > {{ .Values.monitoring.alerts.discoveryConnectionTimeoutPerMinuteThreshold }} | ||
| labels: | ||
| severity: warning | ||
| annotations: | ||
| summary: Task proxy discovery connection timeout rate exceeds threshold | ||
| description: Task proxy observed more than {{ .Values.monitoring.alerts.discoveryConnectionTimeoutPerMinuteThreshold }} connection timeout discovery failures over the last minute. | ||
|
|
||
| {{- end }} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,34 @@ | ||
| {{- if .Values.monitoring.enabled }} | ||
|
|
||
| {{- if eq .Values.monitoring.engine "prometheus" }} | ||
| apiVersion: monitoring.coreos.com/v1 | ||
| kind: ServiceMonitor | ||
| {{- else if eq .Values.monitoring.engine "victoriametrics" }} | ||
| apiVersion: operator.victoriametrics.com/v1beta1 | ||
| kind: VMServiceScrape | ||
| {{- else }} | ||
| {{- fail "Unexpected monitoring engine: $.Values.monitoring.engine" }} | ||
| {{- end }} | ||
| metadata: | ||
| name: task-proxy-service-monitor | ||
| namespace: {{ .Release.Namespace }} | ||
| {{- if eq .Values.monitoring.engine "prometheus" }} | ||
| labels: | ||
| release: prometheus-stack | ||
| {{- end }} | ||
| spec: | ||
| namespaceSelector: | ||
| matchNames: | ||
| - {{ .Release.Namespace }} | ||
| selector: | ||
| matchLabels: | ||
| yt_component: task-proxy | ||
| endpoints: | ||
| - port: metrics | ||
| path: /metrics/prometheus | ||
| interval: 30s | ||
| - port: envoy-metrics | ||
| path: /stats/prometheus | ||
| interval: 30s | ||
|
|
||
| {{- end }} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.