Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 61 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# AGENTS.md

## Purpose

This repository contains YTsaurus task proxy: a small Go service that discovers job-local services in YTsaurus and publishes dynamic Envoy xDS config for stable public routing with optional access checks.

## Repository Layout

- `server/main.go` - binary entrypoint. Parses flags, creates the YT client, runs periodic discovery, and serves xDS + `ext_authz` gRPC on port `9090`.
- `server/pkg/discovery.go` - discovers tasks from running operations. Supports SPYT direct submit, SPYT standalone clusters, and generic operations annotated with `task_proxy`.
- `server/pkg/auth.go` - Envoy external authorization backend. Resolves task from host or `x-yt-taskproxy-*` headers and checks YTsaurus operation read permissions.
- `server/pkg/xds.go` - builds Envoy snapshots: listeners, clusters, virtual hosts, header-based routing, optional TLS, and `ext_authz`.
- `server/pkg/updater.go` - applies the latest snapshot to the cache, refreshes auth lookup data, and writes the `services` table to YT.
- `chart/` - Helm chart for deploying the `envoy` + `server` pod.
- `examples/grpc-service/` - sample gRPC service intended to run inside YTsaurus jobs.

## Common Commands

- `make test` - runs unit tests in `server/pkg`.
- `make build` - builds the Linux `amd64` server binary into `server/server`.
- `make image RELEASE_VERSION=<version>` - builds the Docker image.
- `make helm-chart RELEASE_VERSION=<version>` - packages the Helm chart.

If `make test` fails because the Go tool cannot write to its cache in a sandboxed environment, run tests with a writable cache directory, for example:

```sh
cd server/pkg
GOCACHE=/tmp/go-build GOTMPDIR=/tmp go test ./...
```

## Runtime Model

- Envoy listens on `8080`.
- The Go service serves xDS and `ext_authz` on `9090`.
- Envoy bootstrap config is static in `chart/templates/config.yaml` and points to the local xDS server.
- Dynamic routing is generated from discovered tasks. Each task may be addressed by:
- a hash-based subdomain
- an alias-based subdomain when the YT operation has an alias
- `x-yt-taskproxy-*` routing headers

## Change Guidelines

- Keep changes in `server/pkg/xds.go` and `server/pkg/auth.go` aligned. If you add or rename routing headers or domain formats, update both routing and authorization lookup logic.
- Discovery changes should preserve all currently supported task sources unless the change explicitly removes a scenario.
- `Task.Validate()` constrains alias-based hostnames. If you expand hostname semantics, update validation and tests together.
- When changing chart templates, keep the relationship between:
- `chart/templates/config.yaml`
- `chart/templates/deployment.yaml`
- `server/pkg/const.go`
consistent for ports, TLS mount paths, and container wiring.

## Testing Expectations

- Update unit tests in `server/pkg/*_test.go` for behavior changes.
- `auth_test.go` covers request-to-task resolution precedence.
- `discovery_test.go` covers parsing of `task_proxy` annotations.
- `xds_test.go` covers the generated Envoy snapshot shape.

## Notes

- The repo may contain local, untracked artifacts during development. Do not delete unrelated files unless the task explicitly asks for cleanup.
11 changes: 11 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# CLAUDE.md

This repository keeps its primary agent instructions in `AGENTS.md`.

If you are an automated coding agent working in this repo:

1. Read `AGENTS.md` first.
2. Treat `AGENTS.md` as the source of truth for repository-specific guidance.
3. Use this file only as a wrapper/entrypoint for tools or agents that look specifically for `CLAUDE.md`.

For project context, commands, architecture notes, and change guidelines, see `./AGENTS.md`.
71 changes: 58 additions & 13 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,19 +1,64 @@
REPO=ghcr.io/ytsaurus
REPO := ghcr.io/ytsaurus

TARGET_OS=linux
TARGET_ARCH=amd64
TARGET_OS := linux
TARGET_ARCH := amd64

ifndef RELEASE_VERSION
RELEASE_VERSION = 0.0.0
RELEASE_VERSION ?= 0.0.0

ifdef DOCKER_CONTEXT
DOCKER_ARGS := --context $(DOCKER_CONTEXT)
else
DOCKER_ARGS :=
endif

BUILD_PLATFORM := $(TARGET_OS)/$(TARGET_ARCH)
IMAGE_TAG := $(REPO)/task-proxy:$(RELEASE_VERSION)
CHART_PACKAGE := task-proxy-chart-$(RELEASE_VERSION).tgz

.PHONY: test
test:
cd server/pkg && go test

release:
cd server && GOOS=$(TARGET_OS) GOARCH=$(TARGET_ARCH) go build -o server . && cd ..
docker build --platform $(TARGET_OS)/$(TARGET_ARCH) . -t $(REPO)/task-proxy:$(RELEASE_VERSION)
docker push $(REPO)/task-proxy:$(RELEASE_VERSION)
helm package chart
helm push task-proxy-chart-$(RELEASE_VERSION).tgz oci://$(REPO)
@echo "🧪 Running tests..."
cd server/pkg && go test ./... -v

.PHONY: build
build:
@echo "⚙️ Building binary for $(BUILD_PLATFORM)..."
cd server && \
GOOS=$(TARGET_OS) GOARCH=$(TARGET_ARCH) \
go build -o server .
@echo "✅ Binary built: dist/task-proxy"

.PHONY: image
image: build test
@echo "🐳 Building Docker image: $(IMAGE_TAG)..."
docker $(DOCKER_ARGS) build \
--platform $(BUILD_PLATFORM) \
-t $(IMAGE_TAG) \
.
@echo "✅ Image built: $(IMAGE_TAG)"

.PHONY: helm-chart
helm-chart: image
@echo "📦 Packaging Helm chart version $(RELEASE_VERSION)..."
helm package chart \
--version $(RELEASE_VERSION) \
--app-version $(RELEASE_VERSION) \
--destination .
@echo "✅ Chart packaged: $(CHART_PACKAGE)"

.PHONY: release
release: helm-chart
@echo "🚀 Performing release version $(RELEASE_VERSION)..."
@echo " → Pushing Docker image..."
docker $(DOCKER_ARGS) push $(IMAGE_TAG)
@echo " → Pushing Helm chart..."
helm push $(CHART_PACKAGE) oci://$(REPO)
@echo "✅ Release completed: $(RELEASE_VERSION)"

.PHONY: clean
clean:
@echo "🗑️ Cleaning up artifacts..."
rm -f server/server $(CHART_PACKAGE)
@echo "✅ Cleanup completed"

.DEFAULT_GOAL := helm-chart
80 changes: 80 additions & 0 deletions chart/templates/alerts.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
{{- if and .Values.monitoring.enabled .Values.monitoring.alerts.enabled }}

{{- if eq .Values.monitoring.engine "prometheus" }}
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
{{- else if eq .Values.monitoring.engine "victoriametrics" }}
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
{{- else }}
{{- fail "Unexpected monitoring engine: $.Values.monitoring.engine" }}
{{- end }}
metadata:
name: task-proxy-alerts
namespace: {{ .Release.Namespace }}
{{- if eq .Values.monitoring.engine "prometheus" }}
labels:
release: prometheus-stack
{{- end }}
spec:
groups:
- name: task-proxy
rules:
- alert: TaskProxyAuthorizationInfrastructureFailures
expr: sum(increase(yt_task_proxy_auth_infra_errors_total[1m])) > {{ .Values.monitoring.alerts.authInfrastructureFailuresPerMinuteThreshold }}
labels:
severity: warning
annotations:
summary: Task proxy authorization infrastructure failures exceed threshold
description: Task proxy observed more than {{ .Values.monitoring.alerts.authInfrastructureFailuresPerMinuteThreshold }} authorization infrastructure failures over the last minute.
- alert: TaskProxyAuthorizationContextDeadlineExceeded
expr: sum(increase(yt_task_proxy_auth_infra_errors_total{kind="context_deadline_exceeded"}[1m])) > {{ .Values.monitoring.alerts.authContextDeadlineExceededPerMinuteThreshold }}
labels:
severity: warning
annotations:
summary: Task proxy authorization context deadline exceeded rate exceeds threshold
description: Task proxy observed more than {{ .Values.monitoring.alerts.authContextDeadlineExceededPerMinuteThreshold }} context deadline exceeded authorization failures over the last minute.
- alert: TaskProxyAuthorizationGrpcUnavailable
expr: sum(increase(yt_task_proxy_auth_infra_errors_total{kind="grpc_unavailable"}[1m])) > {{ .Values.monitoring.alerts.authGrpcUnavailablePerMinuteThreshold }}
labels:
severity: warning
annotations:
summary: Task proxy authorization gRPC unavailable rate exceeds threshold
description: Task proxy observed more than {{ .Values.monitoring.alerts.authGrpcUnavailablePerMinuteThreshold }} gRPC unavailable authorization failures over the last minute.
- alert: TaskProxyAuthorizationConnectionTimeout
expr: sum(increase(yt_task_proxy_auth_infra_errors_total{kind="connection_timeout"}[1m])) > {{ .Values.monitoring.alerts.authConnectionTimeoutPerMinuteThreshold }}
labels:
severity: warning
annotations:
summary: Task proxy authorization connection timeout rate exceeds threshold
description: Task proxy observed more than {{ .Values.monitoring.alerts.authConnectionTimeoutPerMinuteThreshold }} connection timeout authorization failures over the last minute.
- alert: TaskProxyDiscoveryInfrastructureFailures
expr: sum(increase(yt_task_proxy_discovery_infra_errors_total[1m])) > {{ .Values.monitoring.alerts.discoveryInfrastructureFailuresPerMinuteThreshold }}
labels:
severity: warning
annotations:
summary: Task proxy discovery infrastructure failures exceed threshold
description: Task proxy observed more than {{ .Values.monitoring.alerts.discoveryInfrastructureFailuresPerMinuteThreshold }} discovery infrastructure failures over the last minute.
- alert: TaskProxyDiscoveryContextDeadlineExceeded
expr: sum(increase(yt_task_proxy_discovery_infra_errors_total{kind="context_deadline_exceeded"}[1m])) > {{ .Values.monitoring.alerts.discoveryContextDeadlineExceededPerMinuteThreshold }}
labels:
severity: warning
annotations:
summary: Task proxy discovery context deadline exceeded rate exceeds threshold
description: Task proxy observed more than {{ .Values.monitoring.alerts.discoveryContextDeadlineExceededPerMinuteThreshold }} context deadline exceeded discovery failures over the last minute.
- alert: TaskProxyDiscoveryGrpcUnavailable
expr: sum(increase(yt_task_proxy_discovery_infra_errors_total{kind="grpc_unavailable"}[1m])) > {{ .Values.monitoring.alerts.discoveryGrpcUnavailablePerMinuteThreshold }}
labels:
severity: warning
annotations:
summary: Task proxy discovery gRPC unavailable rate exceeds threshold
description: Task proxy observed more than {{ .Values.monitoring.alerts.discoveryGrpcUnavailablePerMinuteThreshold }} gRPC unavailable discovery failures over the last minute.
- alert: TaskProxyDiscoveryConnectionTimeout
expr: sum(increase(yt_task_proxy_discovery_infra_errors_total{kind="connection_timeout"}[1m])) > {{ .Values.monitoring.alerts.discoveryConnectionTimeoutPerMinuteThreshold }}
labels:
severity: warning
annotations:
summary: Task proxy discovery connection timeout rate exceeds threshold
description: Task proxy observed more than {{ .Values.monitoring.alerts.discoveryConnectionTimeoutPerMinuteThreshold }} connection timeout discovery failures over the last minute.

{{- end }}
12 changes: 7 additions & 5 deletions chart/templates/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ spec:
spec:
containers:
- name: envoy
image: {{ .Values.proxy.image.repository }}:{{ .Values.proxy.image.tag }}
image: {{ .Values.proxy.image.repository }}{{ if .Values.proxy.image.tag }}:{{ .Values.proxy.image.tag }}{{ end }}
args: ["-c", "/etc/envoy/envoy.yaml", "--service-cluster", "edge-proxy"]
ports:
- name: http
Expand Down Expand Up @@ -46,19 +46,21 @@ spec:
{{ toYaml . | nindent 10 }}
{{- end }}
- name: server
image: {{ .Values.server.image.repository }}:{{ .Values.server.image.tag }}
image: {{ .Values.server.image.repository }}{{ if .Values.server.image.tag }}:{{ .Values.server.image.tag }}{{ end }}
command: ["./server"]
args:
- "-yt-token-path=/etc/yt/token"
- "-namespace={{ .Release.Namespace }}"
- "-yt-proxy={{ default (printf "http-proxies.%s.svc.cluster.local" .Release.Namespace) .Values.ytProxy }}"
- "-base-domain={{ .Values.baseDomain }}"
- "-dir-path={{ .Values.dirPath }}"
- "-discovery-period-seconds={{ .Values.discoveryPeriodSeconds }}"
- "-auth-enabled={{ .Values.auth.enabled }}"
- "-auth-cookie-name={{ .Values.auth.cookieName }}"
ports:
- containerPort: 9090
name: http
name: grpc
- containerPort: 9102
name: metrics
volumeMounts:
- name: token
mountPath: /etc/yt
Expand Down Expand Up @@ -88,4 +90,4 @@ spec:
{{- with .Values.affinity }}
affinity:
{{ toYaml . | nindent 8 }}
{{- end }}
{{- end }}
34 changes: 34 additions & 0 deletions chart/templates/service-monitor.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
{{- if .Values.monitoring.enabled }}

{{- if eq .Values.monitoring.engine "prometheus" }}
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
{{- else if eq .Values.monitoring.engine "victoriametrics" }}
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMServiceScrape
{{- else }}
{{- fail "Unexpected monitoring engine: $.Values.monitoring.engine" }}
{{- end }}
metadata:
name: task-proxy-service-monitor
namespace: {{ .Release.Namespace }}
{{- if eq .Values.monitoring.engine "prometheus" }}
labels:
release: prometheus-stack
{{- end }}
spec:
namespaceSelector:
matchNames:
- {{ .Release.Namespace }}
selector:
matchLabels:
yt_component: task-proxy
endpoints:
- port: metrics
path: /metrics/prometheus
interval: 30s
- port: envoy-metrics
path: /stats/prometheus
interval: 30s

{{- end }}
12 changes: 11 additions & 1 deletion chart/templates/service.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ kind: Service
metadata:
name: {{ .Release.Name }}
namespace: {{ .Release.Namespace }}
labels:
yt_component: task-proxy
Comment thread
imakunin marked this conversation as resolved.
spec:
type:
{{- if .Values.tls.enabled }}
Expand All @@ -20,4 +22,12 @@ spec:
443
{{- else }}
80
{{- end }}
{{- end }}
{{- if .Values.monitoring.enabled }}
- name: metrics
port: 9102
targetPort: 9102
- name: envoy-metrics
port: 9901
targetPort: 9901
{{- end }}
23 changes: 21 additions & 2 deletions chart/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ baseDomain: my-cluster.ytsaurus.example.net

tokenSecretRef: task-proxy-token

# YTsaurus HTTP proxy host. If empty, chart will use
# http-proxies.<release-namespace>.svc.cluster.local
ytProxy: ""

dirPath: //sys/task_proxies

discoveryPeriodSeconds: 60
Expand All @@ -17,17 +21,32 @@ tls:
certSecretRef: yt-domain-cert

proxy:
image:
image:
repository: envoyproxy/envoy
tag: v1.36-latest
resources: {}

server:
image:
image:
repository: ghcr.io/ytsaurus/task-proxy
tag: ""
resources: {}

nodeSelector: {}

affinity: {}

# prometheus/victoriametrics monitoring
monitoring:
enabled: false
engine: prometheus # or victoriametrics
alerts:
enabled: true
authInfrastructureFailuresPerMinuteThreshold: 10
authContextDeadlineExceededPerMinuteThreshold: 10
authGrpcUnavailablePerMinuteThreshold: 10
authConnectionTimeoutPerMinuteThreshold: 10
discoveryInfrastructureFailuresPerMinuteThreshold: 10
discoveryContextDeadlineExceededPerMinuteThreshold: 10
discoveryGrpcUnavailablePerMinuteThreshold: 10
discoveryConnectionTimeoutPerMinuteThreshold: 10
Loading
Loading