Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/CD_production.yml
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ jobs:
run: |
export MAX_INSTANCES="10"
export SERVICE_NAME="ocotillo-api"
export ENTRYPOINT="gunicorn -w 1 -k uvicorn.workers.UvicornWorker main:app"
export ENTRYPOINT="gunicorn -w 4 -k uvicorn.workers.UvicornWorker main:app"
export MIN_INSTANCES="0"
Comment on lines 89 to 92

Copilot AI Apr 15, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumping Gunicorn workers from 1 to 4 increases process count and can significantly raise DB connection usage (each worker typically maintains its own pool) and memory footprint. Consider making the worker count configurable and confirm Cloud SQL connection limits/pool settings are aligned with this change before rolling out to production.

Copilot uses AI. Check for mistakes.
envsubst < .github/app.template.yaml > app.yaml

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/CD_staging.yml
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ jobs:
run: |
export MAX_INSTANCES="10"
export SERVICE_NAME="ocotillo-api-staging"
export ENTRYPOINT="gunicorn -w 1 -k uvicorn.workers.UvicornWorker main:app"
export ENTRYPOINT="gunicorn -w 4 -k uvicorn.workers.UvicornWorker main:app"
export MIN_INSTANCES="0"
Comment on lines 89 to 92

Copilot AI Apr 15, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumping Gunicorn workers from 1 to 4 increases process count and typically multiplies DB connections/pool usage (each worker has its own SQLAlchemy engine/pool) and memory. Consider making the worker count configurable via an env var (with a safe default) and verify Cloud SQL connection limits/pool sizing are compatible with the increased concurrency.

Copilot uses AI. Check for mistakes.
envsubst < .github/app.template.yaml > app.yaml

Expand Down
157 changes: 157 additions & 0 deletions .github/workflows/CD_testing.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
name: CD (Testing)

on:
push:
branches: [jir*]

permissions:
contents: write

jobs:
testing-deploy:

runs-on: ubuntu-latest
environment: staging

steps:
- name: Check out source repository
uses: actions/checkout@v6.0.2
with:
fetch-depth: 0

- name: Install uv in container
uses: astral-sh/setup-uv@v8.0.0
with:
version: "latest"

- name: Generate requirements.txt
run: |
uv export \
--format requirements-txt \
--no-emit-project \
--no-dev \
--output-file requirements.txt

- name: Authenticate to Google Cloud
uses: 'google-github-actions/auth@v3'
with:
credentials_json: ${{ secrets.CLOUD_DEPLOY_SERVICE_ACCOUNT_KEY }}

- name: Run Alembic migrations on staging database
env:
DB_DRIVER: "cloudsql"
CLOUD_SQL_INSTANCE_NAME: "${{ secrets.CLOUD_SQL_INSTANCE_NAME }}"
CLOUD_SQL_DATABASE: "${{ vars.CLOUD_SQL_DATABASE }}"
Comment on lines +40 to +44

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Isolate testing deploy from staging schema migrations

This workflow runs on every jir* branch push but still executes uv run alembic upgrade head with staging credentials (environment: staging and CLOUD_SQL_* values), so an unmerged feature branch can apply schema changes to the shared staging database. In branches that contain migrations, this can break staging for other users before code review/merge; this job should target an isolated testing database/environment or skip migrations.

Useful? React with 👍 / 👎.

CLOUD_SQL_USER: "${{ secrets.CLOUD_SQL_USER }}"
CLOUD_SQL_IAM_AUTH: true
run: |
uv run alembic upgrade head

- name: Refresh materialized views on staging database
env:
DB_DRIVER: "cloudsql"
CLOUD_SQL_INSTANCE_NAME: "${{ secrets.CLOUD_SQL_INSTANCE_NAME }}"
CLOUD_SQL_DATABASE: "${{ vars.CLOUD_SQL_DATABASE }}"
CLOUD_SQL_USER: "${{ secrets.CLOUD_SQL_USER }}"
CLOUD_SQL_IAM_AUTH: true
run: |
uv run python -m cli.cli refresh-pygeoapi-materialized-views

- name: Ensure envsubst is available
run: |
if ! command -v envsubst >/dev/null 2>&1; then
sudo apt-get update
sudo apt-get install -y gettext-base
fi

- name: Render App Engine configs
env:
ENVIRONMENT: "staging"
CLOUD_SQL_INSTANCE_NAME: "${{ secrets.CLOUD_SQL_INSTANCE_NAME }}"
CLOUD_SQL_DATABASE: "${{ vars.CLOUD_SQL_DATABASE }}"
CLOUD_SQL_USER: "${{ secrets.CLOUD_SQL_USER }}"
PYGEOAPI_POSTGRES_DB: "${{ vars.CLOUD_SQL_DATABASE }}"
PYGEOAPI_POSTGRES_USER: "${{ secrets.PYGEOAPI_POSTGRES_USER }}"
PYGEOAPI_POSTGRES_HOST: "${{ vars.PYGEOAPI_POSTGRES_HOST || '127.0.0.1' }}"
PYGEOAPI_POSTGRES_PORT: "${{ vars.PYGEOAPI_POSTGRES_PORT || '5432' }}"
PYGEOAPI_POSTGRES_PASSWORD: "${{ secrets.PYGEOAPI_POSTGRES_PASSWORD }}"
PYGEOAPI_SERVER_URL: "${{ vars.PYGEOAPI_SERVER_URL }}"
CLOUD_SQL_IAM_AUTH: "true"
GCS_SERVICE_ACCOUNT_KEY: "${{ secrets.GCS_SERVICE_ACCOUNT_KEY }}"
GCS_BUCKET_NAME: "${{ vars.GCS_BUCKET_NAME }}"
AUTHENTIK_URL: "${{ vars.AUTHENTIK_URL }}"
AUTHENTIK_CLIENT_ID: "${{ vars.AUTHENTIK_CLIENT_ID }}"
AUTHENTIK_AUTHORIZE_URL: "${{ vars.AUTHENTIK_AUTHORIZE_URL }}"
AUTHENTIK_TOKEN_URL: "${{ vars.AUTHENTIK_TOKEN_URL }}"
SESSION_SECRET_KEY: "${{ secrets.SESSION_SECRET_KEY }}"
APITALLY_CLIENT_ID: "${{ vars.APITALLY_CLIENT_ID }}"
run: |
export MAX_INSTANCES="10"
export SERVICE_NAME="ocotillo-api-testing"
export ENTRYPOINT="gunicorn -w 4 -k uvicorn.workers.UvicornWorker main:app"
export MIN_INSTANCES="0"
envsubst < .github/app.template.yaml > app.yaml

- name: Deploy to Google Cloud
run: |
gcloud app deploy \
app.yaml \
--quiet \
--project ${{ vars.GCP_PROJECT_ID }}

- name: Clean up oldest versions
run: |
SERVICE="ocotillo-api-testing"
VERSIONS_JSON="$(gcloud app versions list --service="$SERVICE" --project=${{ vars.GCP_PROJECT_ID }} --format=json --sort-by="version.createTime" 2>/dev/null || printf '[]')"
export VERSIONS_JSON
DELETE_VERSION="$(python - <<'PY'
import json
import os

versions = json.loads(os.environ.get("VERSIONS_JSON", "[]") or "[]")
if len(versions) <= 1:
print("")
raise SystemExit(0)

def traffic_split(version):
for key in ("traffic_split", "trafficSplit"):
value = version.get(key)
if value is not None:
try:
return float(value)
except (TypeError, ValueError):
return 0.0
return 0.0

for version in versions:
if traffic_split(version) == 0.0:
print(version.get("id", ""))
break
else:
print("")
PY
)"
if [ -n "$DELETE_VERSION" ]; then
echo "Deleting old non-serving version for $SERVICE: $DELETE_VERSION"
gcloud app versions delete "$DELETE_VERSION" --service="$SERVICE" --project=${{ vars.GCP_PROJECT_ID }} --quiet
else
echo "No old non-serving versions to delete for $SERVICE"
fi

- name: Remove rendered configs
run: |
rm app.yaml

# Use PR author's username as git user name
- name: Set up git user
run: |
git config --global user.name "${{ github.actor }}"
git config --global user.email "${{ github.actor }}@users.noreply.github.com"

# ":" are not alloed in git tags, so replace with "-"

Copilot AI Apr 15, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling: "alloed" -> "allowed" in this comment.

Suggested change
# ":" are not alloed in git tags, so replace with "-"
# ":" are not allowed in git tags, so replace with "-"

Copilot uses AI. Check for mistakes.
- name: Tag commit
run: |
git tag -a "testing-deploy-$(date -u +%Y-%m-%d)T$(date -u +%H-%M-%S%z)" -m "testing gcloud deployment: $
(date
-u +%Y-%m-%d)T$(date -u +%H:%M:%S%z)"
Comment on lines +154 to +156

Copilot AI Apr 15, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tag message construction in this step appears syntactically broken: the -m string contains a literal $ followed by a newline, and the date command is split across lines. This will likely cause the shell to error and prevent tags from being created/pushed. Build the message on a single line (or use a heredoc/variable) so the git tag -m argument is valid.

Suggested change
git tag -a "testing-deploy-$(date -u +%Y-%m-%d)T$(date -u +%H-%M-%S%z)" -m "testing gcloud deployment: $
(date
-u +%Y-%m-%d)T$(date -u +%H:%M:%S%z)"
git tag -a "testing-deploy-$(date -u +%Y-%m-%d)T$(date -u +%H-%M-%S%z)" -m "testing gcloud deployment: $(date -u +%Y-%m-%d)T$(date -u +%H:%M:%S%z)"

Copilot uses AI. Check for mistakes.
git push origin --tags
119 changes: 119 additions & 0 deletions ADR2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# ADR2: API Concurrency Fix Strategy

## Summary

This document describes a verified FastAPI concurrency issue in the API stack and recommends a two-phase remediation plan for maintainers.

The API uses synchronous SQLAlchemy sessions backed by `psycopg`. When those sessions are consumed from `async def` route handlers, blocking database work runs on the event loop thread if the handlers call synchronous ORM helpers directly. The lowest-risk immediate fix is to convert database-bound route handlers that do not perform asynchronous work into plain `def`. The longer-term fix is to introduce a real async SQLAlchemy stack and migrate the affected handlers and helpers incrementally.

Comment on lines +6 to +8

Copilot AI Apr 15, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ADR2 states the API uses synchronous SQLAlchemy sessions backed by psycopg, but db/engine.py constructs postgresql+pg8000:// URLs for both Cloud SQL and local Postgres. To avoid misleading maintainers, update the ADR to reflect the actual driver(s) in use (or explain when/where psycopg is used).

Copilot uses AI. Check for mistakes.
## Problem

FastAPI supports synchronous generator dependencies such as `get_db_session()`. The issue is not the dependency shape itself. The issue is that the injected object is a synchronous SQLAlchemy `Session`, and any `async def` route that consumes it while executing synchronous ORM queries directly will block the event loop thread.

In this configuration, FastAPI runs the `async def` route body on the event loop thread. If that body performs blocking database I/O through the synchronous session, the worker cannot make progress on other requests assigned to that event loop until the database call returns. A slow well query can therefore delay unrelated lightweight requests handled by the same worker.

This is a concurrency problem, not a correctness problem. The endpoints can still return correct data while reducing throughput and responsiveness under load.

## Evidence In This Repo

- [`db/engine.py`](db/engine.py) creates `database_sessionmaker = sessionmaker(engine, expire_on_commit=False)` and `get_db_session()` yields a regular synchronous `Session`.
- [`db/engine.py`](db/engine.py) builds synchronous `postgresql+psycopg` engines for both the default PostgreSQL path and the Cloud SQL path, confirming that the active database layer is synchronous.
- [`core/dependencies.py`](core/dependencies.py) injects that session through `session_dependency`.
- [`services/well_details_helper.py`](services/well_details_helper.py) performs synchronous ORM operations such as `session.scalars(...).all()` and related query chains.
- [`api/thing.py`](api/thing.py) contains representative database-backed routes that pass the synchronous session into helper functions such as `get_db_things(...)` and `get_well_details_payload(...)`.
- [`api/asset.py`](api/asset.py) shows a contrasting safe pattern for non-database blocking work by wrapping synchronous GCS calls in `run_in_threadpool(...)`.
- The short-term fix described in this ADR converts database-bound routes from `async def` to `def` where they do not need `await`, but the helper/query layer remains synchronous until a real async session stack is introduced.

## Short-Term Fix

The short-term fix is to convert database-bound route handlers from `async def` to `def` when they do not actually perform asynchronous work.

This lets FastAPI offload the entire route function to a worker thread instead of running its synchronous database calls on the event loop thread. It does not require changing the current database engine, dependency, query helpers, or response schemas.

### Short-term implementation guidance

- Convert any route handler that:
- receives `session: session_dependency`,
- performs synchronous ORM work directly or through helpers, and
- does not require `await` for other operations in the route body.
- Prioritize the highest-value endpoints first:
- high-traffic list and detail endpoints,
- endpoints known to run expensive joins or eager-loads,
- endpoints that affect warmup or perceived application responsiveness.
- Keep route behavior unchanged:
- do not change paths, status codes, payloads, or auth dependencies as part of this phase.
- Avoid mixed patterns:
- do not leave a route as `async def` if it still calls synchronous SQLAlchemy code directly.
- Use `run_in_threadpool(...)` only when a route must remain `async def` for a separate reason, such as mixing in another async operation, and only for isolated blocking helpers rather than as a blanket wrapper for all DB access.

### Expected impact

- Lower risk than a full async migration.
- No intended HTTP contract changes.
- Better worker responsiveness because blocking DB work moves off the event loop thread.

## Long-Term Fix

The long-term fix is to add a real async database stack and migrate selected API areas to it incrementally.

This phase should introduce an explicit async path rather than trying to reuse the current synchronous dependency. Importing async SQLAlchemy primitives is not enough; the repo needs a working async engine, async sessionmaker, async dependency, and async query/helper layer for migrated endpoints.

### Long-term target architecture

- Add an `AsyncEngine` configured for the intended async driver.
- Add an `async_sessionmaker` that yields `AsyncSession` instances.
- Add a dedicated async dependency such as `get_async_db_session()` rather than overloading `get_db_session()`.
- Update migrated handlers and helper functions to use async database access:
- `await session.execute(...)`
- `await session.scalars(...)`
- other `AsyncSession`-compatible patterns as needed

### Long-term migration guidance

- Migrate by subsystem, not all at once.
- Start with a bounded route/helper cluster where the query patterns are understood.
- Keep sync and async paths separate during migration to avoid ambiguous dependencies and accidental sync calls from async routes.
- Treat helper-layer migration as part of the work. Converting route signatures alone is insufficient if the helper functions still expect synchronous sessions.

### Non-goals and cautions

- Do not claim the repo already has a working async DB session path unless one is actually implemented and used.
- Do not treat “switch everything to async” as a trivial refactor.
- Do not mix `AsyncSession` route code with synchronous helper/query internals.

## Recommended Path

The recommended order is:

1. Convert database-bound `async def` routes that do not use `await` into plain `def`.
2. Validate behavior and measure the effect on responsiveness.
3. Introduce a dedicated async DB stack.
4. Migrate selected route/helper subsystems incrementally to `AsyncSession`.

This sequence delivers immediate concurrency improvement with limited risk, while preserving a clear path to a full async architecture later.

## Acceptance Criteria

### Short-term acceptance criteria

- Targeted API tests continue to pass after `async def` to `def` conversions.
- HTTP behavior is unchanged:
- same routes,
- same auth requirements,
- same status codes,
- same payload shapes.
- Concurrency smoke checks or request-timing instrumentation show that DB-heavy requests no longer block the event loop thread for that worker in the same way they do today.

### Long-term acceptance criteria

- Migrated endpoints pass the existing API test coverage for their subsystem.
- The async session lifecycle is correct for successful and failing requests.
- Migrated `async def` routes do not call synchronous session helpers.
- Before/after measurements are captured for latency and concurrency so the migration can be evaluated against real behavior rather than assumptions.

## Defaults And Assumptions

- This document is written for maintainers and assumes familiarity with FastAPI and SQLAlchemy internals.
- The document is self-contained and does not require code changes to be useful.
- The recommended short-term action is intentionally conservative and does not prescribe a file-by-file rollout sequence.
- The recommended long-term action is a staged migration, not a flag-day rewrite.
2 changes: 1 addition & 1 deletion api/author.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
"/{author_id}/publications",
response_model=list[PublicationResponse],
)
async def get_author_publications(
def get_author_publications(
user: viewer_dependency, author_id: int, session: session_dependency
):
"""
Expand Down
Loading
Loading