Skip to content

Support configurable ROCm path (ROCM_PATH / --rocm-path)#75

Open
coketaste wants to merge 1 commit intomainfrom
coketaste/rocm-path
Open

Support configurable ROCm path (ROCM_PATH / --rocm-path)#75
coketaste wants to merge 1 commit intomainfrom
coketaste/rocm-path

Conversation

@coketaste
Copy link
Collaborator

MAD previously assumed ROCm was under /opt/rocm. Newer Rock tar/whl packages often do not create that path, which caused 'Unable to determine gpu vendor' when amd-smi was installed elsewhere.

  • Add get_rocm_path(override); resolution: override -> ROCM_PATH env -> /opt/rocm.
  • Context accepts optional rocm_path for vendor detection, rocminfo, .info/version, docker env.
  • ROCmValidator and detect_gpu_vendor use configurable rocm_path.
  • RunModels passes args.rocm_path to Context and uses ctx for amd-smi paths.
  • Add madengine run --rocm-path. Tests: is_amd_gpu() uses get_rocm_path().

Backward compatible when ROCM_PATH and --rocm-path are unset.

MAD previously assumed ROCm was under /opt/rocm. Newer Rock tar/whl packages often do not create that path, which caused 'Unable to determine gpu vendor' when amd-smi was installed elsewhere.

   - Add get_rocm_path(override); resolution: override -> ROCM_PATH env -> /opt/rocm.
   - Context accepts optional rocm_path for vendor detection, rocminfo, .info/version, docker env.
   - ROCmValidator and detect_gpu_vendor use configurable rocm_path.
   - RunModels passes args.rocm_path to Context and uses ctx for amd-smi paths.
   - Add madengine run --rocm-path. Tests: is_amd_gpu() uses get_rocm_path().

   Backward compatible when ROCM_PATH and --rocm-path are unset.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for non-default ROCm installations by introducing a configurable ROCm root path (CLI --rocm-path and ROCM_PATH env) and wiring it through vendor detection, ROCm validation, and model execution paths.

Changes:

  • Add get_rocm_path() helper to resolve ROCm root via override -> ROCM_PATH -> /opt/rocm.
  • Plumb resolved rocm_path through Context, GPU vendor detection/validation, and run_models (including Docker env var propagation).
  • Update integration test AMD detection to respect configurable ROCm paths.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/test_gpu_renderD_nodes.py Uses get_rocm_path() to locate amd-smi for AMD detection in integration tests.
src/madengine/utils/gpu_validator.py Makes ROCm validation and vendor detection accept an optional rocm_path.
src/madengine/tools/run_models.py Passes --rocm-path into Context and uses context ROCm path for amd-smi invocation.
src/madengine/mad.py Adds madengine run --rocm-path CLI option.
src/madengine/core/context.py Stores/resolves ROCm path in context, exports ROCM_PATH into Docker env, and uses it for vendor detection/rocminfo/version file.
src/madengine/core/constants.py Introduces get_rocm_path() helper for ROCm root resolution.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 109 to 116
# Try hipconfig first
success, stdout, _ = self._run_command(['hipconfig', '--version'])
if success and stdout:
return stdout.split('-')[0] # Remove build suffix

# Try version file
version_file = '/opt/rocm/.info/version'
version_file = os.path.join(self.rocm_path, '.info', 'version')
if os.path.exists(version_file):
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ROCm path is now configurable, but ROCmValidator still runs hipconfig from PATH. If ROCm is installed under a non-default location and the user only sets ROCM_PATH/--rocm-path, version detection will fail even though {rocm_path}/bin/hipconfig exists. Use the resolved self.rocm_path (e.g., run the absolute hipconfig path or prepend {rocm_path}/bin to PATH for subprocess calls).

Copilot uses AI. Check for mistakes.
Comment on lines +201 to 205
# ROCM_PATH allows non-default ROCm installs (e.g. Rock tar/whl) when /opt/rocm is not present.
rocm_path_escaped = self._rocm_path.replace("'", "'\"'\"'")
return self.console.sh(
'bash -c \'if [[ -f /usr/bin/nvidia-smi ]] && $(/usr/bin/nvidia-smi > /dev/null 2>&1); then echo "NVIDIA"; elif [[ -f /opt/rocm/bin/amd-smi ]]; then echo "AMD"; elif [[ -f /usr/local/bin/amd-smi ]]; then echo "AMD"; else echo "Unable to detect GPU vendor"; fi || true\''
f'ROCM_PATH="{rocm_path_escaped}" bash -c \'if [[ -f /usr/bin/nvidia-smi ]] && $(/usr/bin/nvidia-smi > /dev/null 2>&1); then echo "NVIDIA"; elif [[ -f "${{ROCM_PATH}}/bin/amd-smi" ]]; then echo "AMD"; elif [[ -f /usr/local/bin/amd-smi ]]; then echo "AMD"; else echo "Unable to detect GPU vendor"; fi || true\''
)
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_gpu_vendor() builds a shell command by interpolating self._rocm_path into the command string. Since --rocm-path / ROCM_PATH is user-controlled and Console.sh runs with shell=True, this is vulnerable to shell injection / quoting breakage (escaping only ' is not sufficient). Prefer passing ROCM_PATH via the env= parameter to Console.sh (or use shlex.quote and avoid nested shell where possible).

Copilot uses AI. Check for mistakes.
# Get ROCm version
rocm_version_str = self.console.sh("cat /opt/rocm/.info/version | cut -d'-' -f1")
version_file = os.path.join(self._rocm_path, ".info", "version")
rocm_version_str = self.console.sh(f"cat {version_file} | cut -d'-' -f1")
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

version_file comes from configurable self._rocm_path but is interpolated into cat {version_file} without quoting. A ROCm path containing spaces/metacharacters will break the command and can lead to shell injection because Console.sh uses shell=True. Quote version_file (or read the file directly in Python and avoid shell execution).

Suggested change
rocm_version_str = self.console.sh(f"cat {version_file} | cut -d'-' -f1")
try:
with open(version_file, "r", encoding="utf-8") as vf:
rocm_version_str = vf.read()
except OSError as e:
raise RuntimeError(f"Failed to read ROCm version file '{version_file}': {e}")

Copilot uses AI. Check for mistakes.
Comment on lines +63 to +72
self.rocm_path = get_rocm_path(rocm_path)
self.ESSENTIAL_PATHS = {
'rocm_root': self.rocm_path,
'hip_path': os.path.join(self.rocm_path, 'bin', 'hipconfig'),
'rocminfo': os.path.join(self.rocm_path, 'bin', 'rocminfo'),
}
self.RECOMMENDED_PATHS = {
'amd_smi': os.path.join(self.rocm_path, 'bin', 'amd-smi'),
'rocm_smi': os.path.join(self.rocm_path, 'bin', 'rocm-smi'),
}
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ROCmValidator now accepts rocm_path and builds absolute paths in ESSENTIAL_PATHS/RECOMMENDED_PATHS, but later validation still executes tools via PATH (e.g., rocminfo, amd-smi, rocm-smi). This makes rocm_path ineffective for non-default installs unless the user also modifies PATH. Consider having _run_command inject {self.rocm_path}/bin into PATH (or call the absolute tool paths from these dicts) so validation is consistent with the configured ROCm root.

Copilot uses AI. Check for mistakes.
Comment on lines +302 to 304
rocminfo_path = os.path.join(self._rocm_path, "bin", "rocminfo")
arch = self.console.sh(f"{rocminfo_path} |grep -o -m 1 'gfx.*'")
if not arch or arch.strip() == "":
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rocminfo_path is derived from ROCM_PATH/--rocm-path but is injected into a shell pipeline without quoting. If the path contains spaces or shell metacharacters, the command can break or be exploited. Quote the path (e.g., via shlex.quote) or avoid the shell pipeline by executing rocminfo directly and parsing output in Python.

Copilot uses AI. Check for mistakes.
Comment on lines +208 to 210
amd_smi_path = os.path.join(self.context.ctx["rocm_path"], "bin", "amd-smi")
self.console.sh(f"{amd_smi_path} || true")
elif gpu_vendor.find("NVIDIA") != -1:
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amd_smi_path is constructed from configurable rocm_path and then concatenated into a shell command without quoting. If the path contains spaces/metacharacters, this can break execution or enable shell injection. Quote the path (or pass it as an argv list via subprocess rather than a shell string).

Copilot uses AI. Check for mistakes.
Comment on lines +796 to 798
amd_smi_path = os.path.join(self.context.ctx["rocm_path"], "bin", "amd-smi")
smi = model_docker.sh(f"{amd_smi_path} || true")
elif gpu_vendor.find("NVIDIA") != -1:
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here: amd_smi_path comes from configurable rocm_path and is injected into a shell command without quoting. This can break for paths with spaces and is a shell-injection risk. Quote the path or execute via subprocess without shell=True.

Copilot uses AI. Check for mistakes.
Comment on lines +25 to 29
rocm_path = get_rocm_path()
amd_smi_path = os.path.join(rocm_path, "bin", "amd-smi")
vendor = console.sh(
'bash -c \'if [[ -f /opt/rocm/bin/amd-smi ]]; then echo "AMD"; elif [[ -f /usr/local/bin/amd-smi ]]; then echo "AMD"; else echo "OTHER"; fi || true\''
f'bash -c \'if [[ -f "{amd_smi_path}" ]]; then echo "AMD"; elif [[ -f /usr/local/bin/amd-smi ]]; then echo "AMD"; else echo "OTHER"; fi || true\''
)
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amd_smi_path is interpolated into a bash -c string. If ROCM_PATH contains a double quote, the test command can break (and, since Console.sh uses shell=True, it can also become an injection vector). Consider using shlex.quote or passing the path via an env var to the shell snippet instead of embedding it directly.

Copilot uses AI. Check for mistakes.
"""
raw = override if override else os.environ.get("ROCM_PATH", "/opt/rocm")
path = os.path.abspath(os.path.expanduser(str(raw).strip()))
return path.rstrip(os.sep)
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_rocm_path() strips trailing separators with rstrip(os.sep), which turns the root path '/' into an empty string. That can lead to incorrect joins like os.path.join(rocm_path, 'bin', ...) producing relative paths. Consider using os.path.normpath() (or special-case '/') to normalize while preserving the root directory.

Suggested change
return path.rstrip(os.sep)
return os.path.normpath(path)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants