-
Notifications
You must be signed in to change notification settings - Fork 121
Jobs to run of Apps Codegen bulk gen + Evals #4103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
keugenek
wants to merge
17
commits into
main
Choose a base branch
from
exprimental/mcp-evals-nightly
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
990c46e
Skeleton for nightly run of Apps Codegen Evals
keugenek 5a42d63
Fix Python version compatibility for serverless compute
keugenek 8c1e665
Proper url for evals repo
keugenek bdf3d35
Simplify eval runner - clone repo and run evaluation
keugenek 4d58e99
Merge branch 'main' into exprimental/mcp-evals-nightly
keugenek 6602986
Add app generation job using CLI as MCP server
keugenek ea20bd0
Fix UC Volume paths for CLI binary and generated apps
keugenek 3aa5dd8
Fix app generation: use LiteLLM backend + fix UC Volume symlinks
keugenek 61be7e7
Fix eval job CLI parameter parsing
keugenek d4cfd37
Merge remote-tracking branch 'origin/main' into exprimental/mcp-evals…
keugenek 52536b5
Merge branch 'main' into exprimental/mcp-evals-nightly
keugenek cf78b42
Required to bypass proc mount restrictions and AppArmor.
keugenek 3d5b131
Refactor get_prompts function to use external import for prompt retri…
keugenek b08f318
Update README.md for Apps-MCP Evals: Enhance documentation to clarify…
keugenek 7c04633
Merge branch 'main' into exprimental/mcp-evals-nightly
keugenek b2cf28c
Refactor eval setup and runner: Remove Docker installation and manage…
keugenek 6df3873
Add local evaluation functionality: Implement run_local_evaluation fu…
keugenek File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,129 @@ | ||
| # Apps-MCP Evals | ||
|
|
||
| Databricks Asset Bundle for generating and evaluating apps using the Apps-MCP system with klaudbiusz framework. | ||
|
|
||
| ## Overview | ||
|
|
||
| This bundle provides two jobs: | ||
| 1. **Generation Job** - Generates apps using klaudbiusz with the Databricks CLI as MCP server | ||
| 2. **Evaluation Job** - Evaluates generated apps and logs results to MLflow | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| 1. **Databricks Secrets** - Create secret scope and add tokens: | ||
| ```bash | ||
| databricks secrets create-scope apps-mcp-evals | ||
| databricks secrets put-secret apps-mcp-evals anthropic-api-key | ||
| databricks secrets put-secret apps-mcp-evals databricks-token | ||
| ``` | ||
|
|
||
| 2. **UC Volumes** - Create volumes for artifacts: | ||
| ```bash | ||
| databricks volumes create main.default.apps_mcp_artifacts | ||
| databricks volumes create main.default.apps_mcp_generated | ||
| ``` | ||
|
|
||
| 3. **CLI Binary** - Build and upload Linux CLI binary: | ||
| ```bash | ||
| GOOS=linux GOARCH=amd64 go build -o databricks-linux | ||
| databricks fs cp databricks-linux /Volumes/main/default/apps_mcp_artifacts/ | ||
| ``` | ||
|
|
||
| ## Quick Start | ||
|
|
||
| ```bash | ||
| cd experimental/apps-mcp/evals | ||
|
|
||
| # Validate bundle | ||
| databricks bundle validate -t dev | ||
|
|
||
| # Deploy | ||
| databricks bundle deploy -t dev | ||
|
|
||
| # Run generation (creates apps in UC Volume) | ||
| databricks bundle run -t dev apps_generation_job | ||
|
|
||
| # Run evaluation (evaluates apps, logs to MLflow) | ||
| databricks bundle run -t dev apps_eval_job | ||
| ``` | ||
|
|
||
| ## Jobs | ||
|
|
||
| ### Generation Job (`apps_generation_job`) | ||
|
|
||
| Generates apps using klaudbiusz's local_run with LiteLLM backend. | ||
|
|
||
| **Parameters:** | ||
| - `prompts` - Prompt set: `databricks`, `databricks_v2`, or `test` (default: `test`) | ||
| - `cli_binary_volume` - Path to CLI binary volume | ||
| - `apps_volume` - Output volume for generated apps | ||
|
|
||
| **Cluster:** Jobs cluster with Spark 16.2.x (Python 3.12) | ||
|
|
||
| ### Evaluation Job (`apps_eval_job`) | ||
|
|
||
| Evaluates generated apps using klaudbiusz's Docker-based evaluation. | ||
|
|
||
| **Parameters:** | ||
| - `apps_volume` - Volume containing apps to evaluate | ||
| - `mlflow_experiment` - MLflow experiment for logging results | ||
| - `parallelism` - Number of parallel evaluations | ||
|
|
||
| **Cluster:** Jobs cluster with Spark 16.2.x, Docker installed via init script | ||
|
|
||
| **Schedule:** Nightly at 2am UTC | ||
|
|
||
| ## Configuration | ||
|
|
||
| ### Variables | ||
|
|
||
| | Variable | Description | Default | | ||
| |----------|-------------|---------| | ||
| | `prompts` | Prompt set for generation | `test` | | ||
| | `cli_binary_volume` | UC Volume for CLI binary | `/Volumes/main/default/apps_mcp_artifacts` | | ||
| | `apps_volume` | UC Volume for generated apps | `/Volumes/main/default/apps_mcp_generated` | | ||
| | `mlflow_experiment` | MLflow experiment path | `/Shared/apps-mcp-evaluations` | | ||
| | `eval_parallelism` | Parallel eval workers | `4` | | ||
| | `evals_git_url` | klaudbiusz repo URL | `https://github.com/neondatabase/appdotbuild-agent.git` | | ||
|
|
||
| ### Targets | ||
|
|
||
| - **dev** - Development mode, staging MLflow experiment | ||
| - **prod** - Production mode, service principal identity | ||
|
|
||
| ## Monitoring | ||
|
|
||
| - **MLflow** - View metrics at the configured experiment path | ||
| - **Health Alerts** - Eval job alerts if runtime exceeds 2 hours | ||
| - **Logs** - Check job run output for detailed evaluation results | ||
|
|
||
| ## Architecture | ||
|
|
||
| ``` | ||
| evals/ | ||
| ├── databricks.yml # Bundle configuration | ||
| ├── resources/ | ||
| │ ├── apps_generation_job.job.yml # Generation job | ||
| │ └── apps_eval_job.job.yml # Evaluation job | ||
| ├── init/ | ||
| │ ├── setup_generation.sh # Generation cluster init | ||
| │ └── setup_eval.sh # Eval cluster init (Docker) | ||
| ├── src/ | ||
| │ ├── generate_apps.py # App generation orchestrator | ||
| │ └── run_evals.py # Evaluation orchestrator | ||
| └── pyproject.toml # Python package config | ||
| ``` | ||
|
|
||
| ## Prompt Sets | ||
|
|
||
| Available prompt sets (configured via `prompts` variable): | ||
|
|
||
| - `test` - Simple test prompts (1 app) for quick validation | ||
| - `databricks` - 5 Databricks-focused dashboard prompts | ||
| - `databricks_v2` - 20 realistic human-style prompts | ||
|
|
||
| ## Known Limitations | ||
|
|
||
| - Docker containers require `--privileged` flag on Databricks clusters | ||
| - Generation uses LiteLLM backend (Claude Agent SDK has root user restriction) | ||
| - UC Volumes don't support symlinks, uses `latest.txt` file instead |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,62 @@ | ||
| # Databricks Asset Bundle for Apps-MCP Continuous Evals | ||
| # See https://docs.databricks.com/dev-tools/bundles/index.html | ||
| bundle: | ||
| name: apps-mcp-evals | ||
| uuid: 80e50a10-c2da-4b59-99d6-e101b1bcf485 | ||
|
|
||
| include: | ||
| - resources/*.yml | ||
|
|
||
| artifacts: | ||
| apps_mcp_evals: | ||
| type: whl | ||
| build: uv build --wheel | ||
| path: . | ||
|
|
||
| variables: | ||
| catalog: | ||
| description: Unity Catalog for eval results | ||
| default: main | ||
| schema: | ||
| description: Schema for eval tables | ||
| mlflow_experiment: | ||
| description: MLflow experiment path for tracking | ||
| default: /Shared/apps-mcp-evaluations | ||
| evals_git_url: | ||
| description: Git URL for appdotbuild-agent eval framework | ||
| default: https://github.com/neondatabase/appdotbuild-agent.git | ||
| eval_parallelism: | ||
| description: Number of parallel eval workers | ||
| default: "4" | ||
| cli_binary_volume: | ||
| description: UC Volume path for CLI binary | ||
| default: /Volumes/main/default/apps_mcp_artifacts | ||
| apps_volume: | ||
| description: UC Volume path for generated apps | ||
| default: /Volumes/main/default/apps_mcp_generated | ||
| generation_parallelism: | ||
| description: Number of parallel app generations | ||
| default: "4" | ||
| prompts: | ||
| description: Prompt set for generation (databricks, databricks_v2, test) | ||
| default: test | ||
|
|
||
| targets: | ||
| dev: | ||
| mode: development | ||
| default: true | ||
| workspace: | ||
| host: https://6177827686947384.4.gcp.databricks.com | ||
| variables: | ||
| schema: ${workspace.current_user.short_name} | ||
| mlflow_experiment: /Shared/apps-mcp-evaluations-staging | ||
|
|
||
| prod: | ||
| mode: production | ||
| workspace: | ||
| host: https://6177827686947384.4.gcp.databricks.com | ||
| root_path: /Workspace/Users/${workspace.current_user.user_name}/.bundle/${bundle.name}/${bundle.target} | ||
| variables: | ||
| schema: evals | ||
| run_as: | ||
| service_principal_name: apps-mcp-eval-sp |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,18 @@ | ||
| #!/bin/bash | ||
| set -e | ||
|
|
||
| echo "=== Apps-MCP Eval Setup ===" | ||
| echo "Python version: $(python --version)" | ||
|
|
||
| # Install Node.js (required for local npm install/build/test) | ||
| echo "Installing Node.js..." | ||
| curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash - | ||
| sudo apt-get install -y nodejs | ||
|
|
||
| echo "Node version: $(node --version)" | ||
| echo "npm version: $(npm --version)" | ||
|
|
||
| # Install Python dependencies | ||
| pip install fire mlflow | ||
|
|
||
| echo "=== Setup complete ===" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| #!/bin/bash | ||
| set -e | ||
|
|
||
| echo "=== Setting up generation environment ===" | ||
|
|
||
| # Install Dagger (required for klaudbiusz container orchestration) | ||
| echo "Installing Dagger..." | ||
| curl -fsSL https://dl.dagger.io/dagger/install.sh | sh | ||
| export PATH=$PATH:/root/.local/bin | ||
|
|
||
| # Install Python dependencies for klaudbiusz | ||
| echo "Installing Python dependencies..." | ||
| pip install --quiet dagger-io fire tqdm python-dotenv claude-agent-sdk litellm joblib tenacity | ||
|
|
||
| echo "=== Setup complete ===" | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,24 @@ | ||
| [build-system] | ||
| requires = ["hatchling"] | ||
| build-backend = "hatchling.build" | ||
|
|
||
| [tool.hatch.build.targets.wheel] | ||
| packages = ["src"] | ||
|
|
||
| [project] | ||
| name = "apps_mcp_evals" | ||
| version = "0.1.0" | ||
| description = "Continuous evaluation framework for Apps-MCP code generation" | ||
| readme = "README.md" | ||
| requires-python = ">=3.10" | ||
| dependencies = [ | ||
| "mlflow>=2.15.0", | ||
| "fire>=0.7.1", | ||
| ] | ||
|
|
||
| [project.scripts] | ||
| main = "src.run_evals:cli" | ||
|
|
||
| [tool.ruff] | ||
| line-length = 120 | ||
| target-version = "py310" |
68 changes: 68 additions & 0 deletions
68
experimental/apps-mcp/evals/resources/apps_eval_job.job.yml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,68 @@ | ||
| # Apps-MCP Evaluation Job | ||
| # Runs nightly + supports manual trigger via: databricks bundle run -t dev apps_eval_job | ||
|
|
||
| resources: | ||
| jobs: | ||
| apps_eval_job: | ||
| name: "[${bundle.target}] Apps-MCP Continuous Evals" | ||
|
|
||
| # Nightly schedule (2am UTC) | ||
| trigger: | ||
| periodic: | ||
| interval: 1 | ||
| unit: DAYS | ||
|
|
||
| # Health monitoring - alert if eval takes > 2 hours | ||
| health: | ||
| rules: | ||
| - metric: RUN_DURATION_SECONDS | ||
| op: GREATER_THAN | ||
| value: 7200 | ||
|
|
||
| email_notifications: | ||
| on_failure: | ||
| - apps-mcp-team@databricks.com | ||
|
|
||
| parameters: | ||
| - name: mlflow_experiment | ||
| default: ${var.mlflow_experiment} | ||
| - name: parallelism | ||
| default: ${var.eval_parallelism} | ||
| - name: evals_git_url | ||
| default: ${var.evals_git_url} | ||
| - name: apps_volume | ||
| default: ${var.apps_volume} | ||
|
|
||
| job_clusters: | ||
| - job_cluster_key: eval_cluster | ||
| new_cluster: | ||
| spark_version: "16.2.x-scala2.12" | ||
| node_type_id: "n2-standard-4" | ||
| num_workers: 0 | ||
| data_security_mode: SINGLE_USER | ||
| spark_conf: | ||
| spark.databricks.cluster.profile: singleNode | ||
| spark.master: "local[*]" | ||
| custom_tags: | ||
| ResourceClass: SingleNode | ||
| spark_env_vars: | ||
| DATABRICKS_HOST: ${workspace.host} | ||
| DATABRICKS_TOKEN: "{{secrets/apps-mcp-evals/databricks-token}}" | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think PAT is necessary here |
||
| init_scripts: | ||
| - workspace: | ||
| destination: ${workspace.file_path}/init/setup_eval.sh | ||
|
|
||
| tasks: | ||
| - task_key: run_evals | ||
| job_cluster_key: eval_cluster | ||
| spark_python_task: | ||
| python_file: ${workspace.file_path}/src/run_evals.py | ||
| parameters: | ||
| - --mlflow-experiment | ||
| - ${var.mlflow_experiment} | ||
| - --parallelism | ||
| - ${var.eval_parallelism} | ||
| - --evals-git-url | ||
| - ${var.evals_git_url} | ||
| - --apps-volume | ||
| - ${var.apps_volume} | ||
39 changes: 39 additions & 0 deletions
39
experimental/apps-mcp/evals/resources/apps_generation_job.job.yml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| resources: | ||
| jobs: | ||
| apps_generation_job: | ||
| name: "[${bundle.target}] Apps-MCP Generation" | ||
|
|
||
| job_clusters: | ||
| - job_cluster_key: generation_cluster | ||
| new_cluster: | ||
| spark_version: "16.2.x-scala2.12" | ||
| node_type_id: "n2-standard-8" | ||
| num_workers: 0 | ||
| data_security_mode: SINGLE_USER | ||
| spark_conf: | ||
| spark.databricks.cluster.profile: singleNode | ||
| spark.master: "local[*]" | ||
| custom_tags: | ||
| ResourceClass: SingleNode | ||
| spark_env_vars: | ||
| ANTHROPIC_API_KEY: "{{secrets/apps-mcp-evals/anthropic-api-key}}" | ||
| DATABRICKS_HOST: ${workspace.host} | ||
| DATABRICKS_TOKEN: "{{secrets/apps-mcp-evals/databricks-token}}" | ||
| init_scripts: | ||
| - workspace: | ||
| destination: ${workspace.file_path}/init/setup_generation.sh | ||
|
|
||
| tasks: | ||
| - task_key: generate_apps | ||
| job_cluster_key: generation_cluster | ||
| spark_python_task: | ||
| python_file: ${workspace.file_path}/src/generate_apps.py | ||
| parameters: | ||
| - --mcp-binary | ||
| - ${var.cli_binary_volume}/databricks-linux | ||
| - --output-volume | ||
| - ${var.apps_volume} | ||
| - --prompts | ||
| - ${var.prompts} | ||
| - --max-concurrency | ||
| - ${var.generation_parallelism} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| """Apps-MCP Evaluation Framework.""" |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove