Skip to content

backport 1.10 - fix(lightspeed): pre-create /rag-content/vector_db/notebooks in init …#449

Merged
openshift-merge-bot[bot] merged 2 commits into
redhat-developer:release-1.10from
JslYoon:fix/RHDHBUGS-3371-lightspeed-notebooks-permissions
Jul 1, 2026
Merged

backport 1.10 - fix(lightspeed): pre-create /rag-content/vector_db/notebooks in init …#449
openshift-merge-bot[bot] merged 2 commits into
redhat-developer:release-1.10from
JslYoon:fix/RHDHBUGS-3371-lightspeed-notebooks-permissions

Conversation

@JslYoon

@JslYoon JslYoon commented Jun 22, 2026

Copy link
Copy Markdown

…container

On EKS/AKS, the RAG init container populates /rag-content/ but never creates the notebooks subdirectory. At runtime, llama-stack tries to write /rag-content/vector_db/notebooks/faiss_store.db and fails with PermissionError because it cannot create the directory on a volume it doesn't own. OCP avoids this via fsGroup/supplemental group defaults.

The fix pre-creates the directory and widens permissions before the sidecar starts, matching the fix the operator already applies via chmod -R 777 for the rest of vector_db.

Fixes: RHDHBUGS-3371

Screenshot From 2026-06-29 00-20-46

ran and verified on kind

Description of the change

Which issue(s) does this PR fix or relate to

How to test changes / Special notes to the reviewer

Checklist

  • For each Chart updated, version bumped in the corresponding Chart.yaml according to Semantic Versioning.
  • For each Chart updated, variables are documented in the values.yaml and added to the corresponding README.md. The pre-commit utility can be used to generate the necessary content. Run pre-commit run --all-files to run the hooks and then push any resulting changes. The pre-commit Workflow will enforce this and warn you if needed.
  • JSON Schema template updated and re-generated the raw schema via the pre-commit hook.
  • Tests pass using the Chart Testing tool and the ct lint command.
  • If you updated the orchestrator-infra chart, make sure the versions of the Knative CRDs are aligned with the versions of the CRDs installed by the OpenShift Serverless operators declared in the values.yaml file. See Installing Knative Eventing and Knative Serving CRDs for more details.

@JslYoon JslYoon requested a review from a team as a code owner June 22, 2026 21:15
@openshift-ci openshift-ci Bot requested review from Fortune-Ndlovu and rm3l June 22, 2026 21:15
@rhdh-qodo-merge

Copy link
Copy Markdown

PR Summary by Qodo

Fix Lightspeed RAG init to pre-create notebooks dir and relax /rag-content perms
🐞 Bug fix ⚙️ Configuration changes 🕐 Less than 10 minutes

Grey Divider

Description

• Pre-create /rag-content/vector_db/notebooks during RAG bootstrap init to avoid runtime mkdir
 failures.
• Widen /rag-content permissions so the llama-stack sidecar can write FAISS notebook storage on
 EKS/AKS.
Diagram

graph TD
  A["RAG bootstrap init"] --> B["Copy vector_db"] --> C[("/rag-content PV")]
  A --> D["mkdir notebooks"] --> E["/vector_db/notebooks"]
  A --> F["chmod a+rwX"] --> C
  G["llama-stack sidecar"] --> H["write faiss_store.db"] --> E
  subgraph Legend
    direction LR
    _job["Init/Runtime step"] ~~~ _pv[("Persistent Volume")]
  end
Loading
High-Level Assessment

The following are alternative approaches to this PR:

1. Set pod-level fsGroup/supplementalGroups for the volume
  • ➕ Aligns with OpenShift-style permission handling without chmod
  • ➕ Avoids broad world-writable permissions
  • ➖ May not work consistently across all storage classes/CSI drivers
  • ➖ Requires chart/pod spec changes beyond the init script and might impact other containers
2. Chown/chmod only the notebooks path (minimal scope)
  • ➕ Limits permission broadening to the exact directory needed
  • ➕ Reduces security exposure compared to recursive /rag-content chmod
  • ➖ May miss other write paths under /rag-content on some deployments
  • ➖ Requires careful auditing of all runtime write locations

Recommendation: Current approach is pragmatic for EKS/AKS: pre-creating the directory removes the failing mkdir path, and recursive a+rwX matches the existing operator behavior for vector_db. If security posture is a concern, consider narrowing the chmod scope to vector_db/notebooks in a follow-up.

Files changed (1) +2 / -0

Bug fix (1) +2 / -0
values.yamlCreate notebooks dir and relax /rag-content permissions in RAG init script +2/-0

Create notebooks dir and relax /rag-content permissions in RAG init script

• Adds a 'mkdir -p /rag-content/vector_db/notebooks' step during Lightspeed RAG bootstrap. Applies 'chmod -R a+rwX /rag-content' so the runtime sidecar can write the FAISS notebook store on volumes it does not own (notably EKS/AKS).

charts/backstage/values.yaml

@rhdh-qodo-merge

rhdh-qodo-merge Bot commented Jun 22, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (1) 📘 Rule violations (0) 📎 Requirement gaps (0) 📜 Skill insights (0)

Context used
✅ Tickets: RHDHBUGS-3371

Grey Divider


Action required

1. Non-root chmod breaks init ✓ Resolved 🐞 Bug ☼ Reliability
Description
The Lightspeed RAG initContainer runs as non-root with all capabilities dropped, but the new `chmod
-R a+rwX /rag-content can fail when the /rag-content` mount root isn’t owned by the container
user; because the script is chained with &&, this aborts the initContainer and blocks the pod in
init.
Code

charts/backstage/values.yaml[141]

+          chmod -R a+rwX /rag-content &&
Evidence
The initContainer explicitly runs as non-root and drops all capabilities; the new chmod is part of
an &&-chained shell script, so any chmod failure aborts the init. /rag-content is mounted from
an emptyDir volume, and the upstream chart defaults podSecurityContext to {} (no
fsGroup/ownership configuration), making chmod on the mount root a likely failure mode for non-root
containers.

charts/backstage/values.yaml[126-160]
charts/backstage/vendor/backstage/charts/backstage/templates/backstage-deployment.yaml[98-110]
charts/backstage/vendor/backstage/charts/backstage/templates/backstage-deployment.yaml[135-160]
charts/backstage/vendor/backstage/charts/backstage/values.yaml[272-279]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The init script runs `chmod -R a+rwX /rag-content` while the initContainer is `runAsNonRoot` with capabilities dropped. If the `/rag-content` mount root isn’t owned by the initContainer user, `chmod` returns non-zero and, due to `&&` chaining, the initContainer exits and the pod never starts.

### Issue Context
`/rag-content` is mounted from an `emptyDir` volume, and this chart doesn’t configure a pod-level security context (e.g., `fsGroup`) that would make ownership predictable for non-root chmod.

### Fix Focus Areas
- charts/backstage/values.yaml[126-142]
- charts/backstage/vendor/backstage/charts/backstage/templates/backstage-deployment.yaml[98-110]
- charts/backstage/vendor/backstage/charts/backstage/templates/backstage-deployment.yaml[135-160]
- charts/backstage/vendor/backstage/charts/backstage/values.yaml[272-279]

### Suggested fix
Change the chmod to target only paths created by the initContainer (and/or avoid touching the mount root), e.g.:
- `chmod -R a+rwX /rag-content/vector_db /rag-content/embeddings_model`
 or
- `chmod -R a+rwX /rag-content/*`
Additionally, ensure the notebooks directory itself is writable by the runtime UID (e.g., `mkdir -p -m 0777 /rag-content/vector_db/notebooks`).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Overbroad permissions change 🐞 Bug ⚙ Maintainability
Description
chmod -R a+rwX /rag-content makes the entire RAG content tree writable, even though the configured
write target is specifically the notebooks sqlite DB under /rag-content/vector_db/notebooks; this
is broader than necessary and increases the blast radius of accidental writes.
Code

charts/backstage/values.yaml[141]

+          chmod -R a+rwX /rag-content &&
Evidence
The Lightspeed configuration declares a sqlite DB under /rag-content/vector_db/notebooks, but the
init script applies recursive write permissions to the entire /rag-content mount, not just the DB
directory subtree.

charts/backstage/files/lightspeed/config.yaml[153-166]
charts/backstage/values.yaml[135-142]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The initContainer recursively grants write permissions to all of `/rag-content`, including data that is expected to be static (e.g., the embeddings model), while the runtime write path in config is the notebooks sqlite DB under `/rag-content/vector_db/notebooks`.

### Issue Context
The Lightspeed config explicitly points the notebooks store to `/rag-content/vector_db/notebooks/faiss_store.db`, so write permissions only need to cover that subtree (and potentially other vector_db DB locations), not the whole mount.

### Fix Focus Areas
- charts/backstage/values.yaml[135-142]
- charts/backstage/files/lightspeed/config.yaml[153-166]

### Suggested fix
Restrict permission widening to the minimal required subtree, e.g.:
- `mkdir -p -m 0777 /rag-content/vector_db/notebooks`
- `chmod -R a+rwX /rag-content/vector_db`
Avoid granting write permissions across all of `/rag-content` unless there is a documented runtime need for it.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

@rm3l rm3l left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JslYoon You'll need to also bump the chart version, run the pre-commit hooks and push the resulting changes. See the checklist on the PR description. Thanks.

@Jdubrick Jdubrick left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JslYoon @rm3l , I think we need a combination of podSecurityContext and making sure the permissions aren't copied during init. I was testing this locally and chmod fails on its own. This issue is really stemming from the fact the RAG containers UID is 65532 and not 1001, but this suggestion should match what OCP already does, which is set a blanket UID for the Pod. Since this is only failing on vanilla Kubernetes. What do you think?

Comment thread charts/backstage/values.yaml
Comment thread charts/backstage/values.yaml
@JslYoon JslYoon changed the title fix(lightspeed): pre-create /rag-content/vector_db/notebooks in init … backport 1.10 - fix(lightspeed): pre-create /rag-content/vector_db/notebooks in init … Jun 23, 2026
@JslYoon JslYoon force-pushed the fix/RHDHBUGS-3371-lightspeed-notebooks-permissions branch from 3fea267 to 0caaa25 Compare June 23, 2026 17:44
@JslYoon JslYoon requested a review from rm3l June 23, 2026 18:02

@rm3l rm3l left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a combination of podSecurityContext and making sure the permissions aren't copied during init. I was testing this locally and chmod fails on its own. This issue is really stemming from the fact the RAG containers UID is 65532 and not 1001, but this suggestion should match what OCP already does, which is set a blanket UID for the Pod. Since this is only failing on vanilla Kubernetes. What do you think?

@Jdubrick @JslYoon In the current non-OCP install docs (e.g., EKS), we've been instructing users to only set upstream.backstage.podSecurity.fsGroup to a random value. I feel like we only need to set this here (even the runAsUser or runAsGroup seem unncessary). The most important point IMO is that the chmod -R a+rwX /rag-content won't work as the root volume is still owned by root. So you would need to chmod the subdirectories instead (chmod -R a+rwX /rag-content/vector_db /rag-content/embeddings_model).
Having fsGroup set on non-OCP would be sufficient to get the supplementary GID propagated to the container process.

Comment thread charts/backstage/values.yaml Outdated
Comment thread charts/backstage/values.yaml Outdated
Comment thread charts/backstage/values.yaml Outdated
@JslYoon JslYoon force-pushed the fix/RHDHBUGS-3371-lightspeed-notebooks-permissions branch 6 times, most recently from 0725d0e to a276301 Compare June 27, 2026 19:21
@JslYoon JslYoon requested review from Jdubrick and rm3l June 29, 2026 04:32
Comment thread charts/backstage/values.yaml Outdated
@JslYoon JslYoon force-pushed the fix/RHDHBUGS-3371-lightspeed-notebooks-permissions branch 2 times, most recently from a276301 to b7d637f Compare June 29, 2026 18:46
Comment thread charts/backstage/values.yaml
@openshift-ci openshift-ci Bot removed the lgtm label Jun 30, 2026
@rm3l

rm3l commented Jun 30, 2026

Copy link
Copy Markdown
Member

/lgtm cancel

@JslYoon JslYoon requested a review from rm3l June 30, 2026 17:28
…creates the notebooks subdirectory. At runtime, llama-stack tries to write /rag-content/vector_db/notebooks/faiss_store.db and fails with PermissionError because it cannot create the directory on a volume it doesn't own. OCP avoids this via fsGroup/supplemental group defaults.

The fix pre-creates the directory and widens permissions before the sidecar starts, matching the fix the operator already applies via chmod -R 777 for the rest of vector_db.

Signed-off-by: Lucas <lyoon@redhat.com>
@JslYoon JslYoon force-pushed the fix/RHDHBUGS-3371-lightspeed-notebooks-permissions branch from 043ed00 to 56c5d3e Compare June 30, 2026 19:12
Comment thread .pre-commit-config.yaml Outdated
Co-authored-by: Armel Soro <armel@rm3l.org>
@sonarqubecloud

sonarqubecloud Bot commented Jul 1, 2026

Copy link
Copy Markdown

@openshift-ci openshift-ci Bot added the lgtm label Jul 1, 2026
@rm3l

rm3l commented Jul 1, 2026

Copy link
Copy Markdown
Member

/cherry-pick main

@openshift-cherrypick-robot

Copy link
Copy Markdown

@rm3l: once the present PR merges, I will cherry-pick it on top of main in a new PR and assign it to you.

Details

In response to this:

/cherry-pick main

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot Bot merged commit 810d8b0 into redhat-developer:release-1.10 Jul 1, 2026
6 checks passed
@openshift-cherrypick-robot

Copy link
Copy Markdown

@rm3l: #449 failed to apply on top of branch "main":

Applying: On EKS/AKS, the RAG init container populates /rag-content/ but never creates the notebooks subdirectory. At runtime, llama-stack tries to write /rag-content/vector_db/notebooks/faiss_store.db and fails with PermissionError because it cannot create the directory on a volume it doesn't own. OCP avoids this via fsGroup/supplemental group defaults.
Using index info to reconstruct a base tree...
M	.github/actions/test-charts/action.yml
M	charts/backstage/Chart.yaml
M	charts/backstage/README.md
M	charts/backstage/values.schema.json
M	charts/backstage/values.yaml
Falling back to patching base and 3-way merge...
Auto-merging .github/actions/test-charts/action.yml
CONFLICT (content): Merge conflict in .github/actions/test-charts/action.yml
Auto-merging charts/backstage/Chart.yaml
CONFLICT (content): Merge conflict in charts/backstage/Chart.yaml
Auto-merging charts/backstage/README.md
CONFLICT (content): Merge conflict in charts/backstage/README.md
Auto-merging charts/backstage/values.schema.json
Auto-merging charts/backstage/values.yaml
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0001 On EKS/AKS, the RAG init container populates /rag-content/ but never creates the notebooks subdirectory. At runtime, llama-stack tries to write /rag-content/vector_db/notebooks/faiss_store.db and fails with PermissionError because it cannot create the directory on a volume it doesn't own. OCP avoids this via fsGroup/supplemental group defaults.

Details

In response to this:

/cherry-pick main

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

rm3l added a commit to rm3l/rhdh-chart that referenced this pull request Jul 1, 2026
…tebooks in init … (redhat-developer#449)

* On EKS/AKS, the RAG init container populates /rag-content/ but never creates the notebooks subdirectory. At runtime, llama-stack tries to write /rag-content/vector_db/notebooks/faiss_store.db and fails with PermissionError because it cannot create the directory on a volume it doesn't own. OCP avoids this via fsGroup/supplemental group defaults.

The fix pre-creates the directory and widens permissions before the sidecar starts, matching the fix the operator already applies via chmod -R 777 for the rest of vector_db.

Signed-off-by: Lucas <lyoon@redhat.com>

* Apply suggestions from code review

Co-authored-by: Armel Soro <armel@rm3l.org>

---------

Signed-off-by: Lucas <lyoon@redhat.com>
Co-authored-by: Armel Soro <armel@rm3l.org>
openshift-merge-bot Bot pushed a commit that referenced this pull request Jul 1, 2026
…tebooks in init … (#449) (#460)

* On EKS/AKS, the RAG init container populates /rag-content/ but never creates the notebooks subdirectory. At runtime, llama-stack tries to write /rag-content/vector_db/notebooks/faiss_store.db and fails with PermissionError because it cannot create the directory on a volume it doesn't own. OCP avoids this via fsGroup/supplemental group defaults.

The fix pre-creates the directory and widens permissions before the sidecar starts, matching the fix the operator already applies via chmod -R 777 for the rest of vector_db.



* Apply suggestions from code review



---------

Signed-off-by: Lucas <lyoon@redhat.com>
Co-authored-by: Lucas Yoon <94267691+JslYoon@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants