Skip to content

Refactor deployment workflows to auto-discover shared replica ALB URL#296

Merged
jfrench9 merged 3 commits intomainfrom
refactor/improved-deployment
Feb 5, 2026
Merged

Refactor deployment workflows to auto-discover shared replica ALB URL#296
jfrench9 merged 3 commits intomainfrom
refactor/improved-deployment

Conversation

@jfrench9
Copy link
Member

@jfrench9 jfrench9 commented Feb 5, 2026

Summary

This refactor modernizes our deployment infrastructure by replacing manual ALB URL configuration with automatic CloudFormation-based discovery. The changes streamline deployment workflows and enhance system scalability for read operations.

Key Accomplishments

  • Simplified Configuration Management: Eliminated manual shared-replica-alb-url parameters from all deployment workflows (API, Dagster, Graph, production, and staging)
  • Enhanced Environment Configuration: Implemented intelligent CloudFormation lookup mechanism in the environment configuration module to automatically resolve shared replica ALB URLs
  • Improved Security Group Resolution: Refactored API Security Group resolution logic in deployment workflows for better maintainability
  • Reduced CloudFormation Complexity: Removed hardcoded ALB URL parameters from CloudFormation templates, making them more flexible and environment-agnostic

Breaking Changes

⚠️ Configuration Changes Required:

  • Manual shared-replica-alb-url workflow inputs are no longer supported
  • Deployment processes now rely on CloudFormation stack lookups for ALB URL resolution
  • Existing deployment scripts or external tools referencing the removed parameters will need updates

Infrastructure Considerations

  • Zero Downtime: Changes maintain backward compatibility during the transition period
  • CloudFormation Dependencies: Deployments now require proper CloudFormation stack naming conventions for automatic discovery
  • Enhanced Scalability: Auto-discovery mechanism supports dynamic scaling of read replica infrastructure
  • Reduced Manual Errors: Eliminates potential configuration drift from manual URL management

Testing Notes

  • Verify CloudFormation stack naming follows expected patterns for lookup functionality
  • Test deployment workflows in staging environment to ensure proper ALB URL resolution
  • Validate that existing read operations continue to function correctly with auto-discovered URLs
  • Confirm security group resolution works as expected in all target environments

🤖 Generated with Claude Code

Branch Info:

  • Source: refactor/improved-deployment
  • Target: main
  • Type: refactor

Co-Authored-By: Claude noreply@anthropic.com

… and CloudFormation templates. Implement auto-discovery of the shared replica ALB URL in the environment configuration using CloudFormation lookups. This change simplifies the deployment process by eliminating manual inputs and enhances scalability for read operations.
- Simplified the process of determining the API Security Group ID by removing environment-based conditional logic.
- Updated the security group lookup to directly use the environment name in the filter, enhancing clarity and maintainability.
- Removed redundant echo statements related to stack names, streamlining the script.
@jfrench9
Copy link
Member Author

jfrench9 commented Feb 5, 2026

@claude please review this PR

@claude
Copy link

claude bot commented Feb 5, 2026

Claude finished @jfrench9's task in 1m 34s —— View job


Code Review Complete ✅

I've completed a thorough review of this refactor. This is a well-executed modernization of the deployment infrastructure. Below are my findings:


✅ Strengths

1. Excellent Architecture Decision

The move from manual configuration to CloudFormation-based auto-discovery is the right approach. This eliminates configuration drift and reduces deployment complexity.

2. Clean Implementation

The _get_shared_replica_alb_url_from_cloudformation() function in env.py (lines 269-318) is well-designed:

  • Proper caching to avoid repeated API calls
  • Graceful fallback when boto3 isn't available (local dev)
  • Environment-aware (only runs in prod/staging)
  • Follows consistent naming conventions

3. Backward Compatible During Transition

The implementation uses a fallback pattern:

SHARED_REPLICA_ALB_URL = (
    get_str_env("SHARED_REPLICA_ALB_URL", "")
    or _get_shared_replica_alb_url_from_cloudformation()
)

This allows explicit env var override if needed, then auto-discovery.

4. Consistent Refactoring

All 5 workflow files correctly removed the manual parameter inputs, keeping the changes uniform.


⚠️ Issues Found

1. Missing Error Handling (Medium Priority)

Location: robosystems/config/env.py:269-318

The CloudFormation lookup silently returns empty string on all errors. This could hide legitimate problems during deployment.

Issue:

except Exception:
    # Any other error - fail silently and return empty string
    _cloudformation_cache[cache_key] = None
    return ""

Recommendation: Add logging to help diagnose issues:

except Exception as e:
    # Log the error for debugging (use print to avoid circular import)
    print(f"Warning: Failed to lookup shared replica ALB URL from CloudFormation: {e}")
    _cloudformation_cache[cache_key] = None
    return ""

2. Potential Runtime Failure Path (High Priority)

Location: robosystems/graph_api/client/factory.py:457-479

The routing logic depends on SHARED_REPLICA_ALB_URL being populated, but if CloudFormation lookup fails silently AND SHARED_MASTER_READS_ENABLED is false, read operations will fail at runtime.

Current behavior: Application starts successfully but fails when attempting shared repository reads.

Recommendation: Add a startup validation check in robosystems/config/validation.py (referenced in env.py:165 comments):

if env.is_aws_environment() and GraphTypeRegistry.SHARED_REPOSITORIES:
    if not env.SHARED_REPLICA_ALB_URL and not env.SHARED_MASTER_READS_ENABLED:
        errors.append(
            "Shared repositories are configured but no read endpoint available. "
            "Verify CloudFormation stack 'RoboSystemsGraphSharedReplicas{Prod|Staging}' "
            "exists with 'ALBEndpoint' output, or enable SHARED_MASTER_READS_ENABLED."
        )

3. Hardcoded Stack Naming Convention (Low Priority)

Location: robosystems/config/env.py:292-293

The stack name pattern is hardcoded:

env_suffix = "Prod" if environment == "prod" else "Staging"
stack_name = f"RoboSystemsGraphSharedReplicas{env_suffix}"

Risk: If stack naming changes in the future, this breaks silently.

Recommendation: Consider making this configurable via an environment variable with the current value as default, or document the naming convention dependency in comments.

4. Missing Test Coverage (Medium Priority)

Observation: No tests found for the new CloudFormation lookup functionality.

Recommendation: Add unit tests in /tests/config/ to verify:

  • Successful CloudFormation lookup
  • Proper caching behavior
  • Graceful fallback on errors
  • Environment filtering (dev vs prod/staging)

📋 Deployment Checklist

Before merging, verify:

  1. CloudFormation Stack Exists: Ensure RoboSystemsGraphSharedReplicas{Prod|Staging} stacks have been deployed with the ALBEndpoint output
  2. Staging Test: Deploy to staging and verify:
    # Via bastion tunnel
    just bastion-tunnel staging api
    curl http://localhost:8000/health
    # Check logs for CloudFormation lookup success
    just logs api | grep "shared replica"
  3. Fallback Path: If replica stacks don't exist yet, ensure SHARED_MASTER_READS_ENABLED=true is set

🔍 Security Review

No security concerns identified:

  • CloudFormation API calls use existing IAM role permissions
  • No sensitive data exposure
  • Caching doesn't persist sensitive information
  • Graceful failure modes don't leak information

📊 Breaking Changes Assessment

The PR description accurately identifies breaking changes. Additional considerations:

  1. Rollback Plan: If issues occur post-deployment, the manual parameter can be re-added temporarily and set via environment variable override
  2. External Tools: Any CI/CD pipelines or external automation referencing the removed workflow parameters will need updates
  3. Documentation: The CLAUDE.md file doesn't mention this auto-discovery feature - consider adding a note about how shared replica URLs are resolved

💡 Recommendations

Priority: High

  • Add error logging to CloudFormation lookup function (env.py:315)
  • Add startup validation for shared repository read endpoint availability

Priority: Medium

  • Add unit tests for CloudFormation lookup logic
  • Update env.py docstring (line 1-19) to mention auto-discovery for SHARED_REPLICA_ALB_URL

Priority: Low

  • Document stack naming convention dependency
  • Consider making stack name configurable for maximum flexibility

✅ Approval Status

Recommendation: Approve with minor improvements

This is solid work that meaningfully improves the deployment process. The issues identified are mostly about defensive programming and observability rather than fundamental flaws. The refactor can be merged after addressing the high-priority error handling items.

The architecture is sound, the implementation is clean, and the approach aligns well with the infrastructure-as-code philosophy evident throughout the codebase.


Files Reviewed:

  • .github/workflows/deploy-api.yml
  • .github/workflows/deploy-dagster.yml
  • .github/workflows/deploy-graph.yml
  • .github/workflows/prod.yml
  • .github/workflows/staging.yml
  • cloudformation/api.yaml
  • robosystems/config/env.py

…ation

- Updated exception handling to print a warning message when the lookup fails, providing better visibility into potential issues during the process.
- Removed the comment indicating the absence of boto3, streamlining the code for clarity.
@jfrench9 jfrench9 merged commit 90ed7b9 into main Feb 5, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant