madengine v2 with unified framework for local and distribution by coketaste · Pull Request #57 · ROCm/madengine

coketaste · 2025-12-05T15:43:13Z

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

… structure that emphasizes its core strengths in MAD package integration and distributed model execution

…d_cli unit tests

…, and K8s; Expanded command line interface;

…environment variables, which is particularly useful in CI/CD environments, containerized deployments, or when you want to avoid storing credentials in files

[WIP] Enhanced distributed execution with runners

…into coketaste/refactor

Fix multi-result on k8s

Update node list config for SLURM

…engine into coketaste/refactor-dis

…ARY_PATH

…rocess

Silent error check via subprocess

Update the additional context and manifest printout

Update the interface of report to-html

* feat(madengine): ROCm path override and RPD e2e fix ROCM_PATH / --rocm-path: - Add get_rocm_path() and wire through Context, ROCmToolManager, gpu_tool_factory, gpu_validator, container_runner, run orchestrator; add --rocm-path to run CLI - Unit tests for get_rocm_path, Context, ROCmToolManager, run --help; update fixtures and test_get_cached_managers for (vendor, rocm_path) cache RPD e2e: - pre_scripts/trace.sh: install nlohmann-json3-dev (Ubuntu) and json-devel (CentOS) so rocmProfileData rpd_tracer build finds nlohmann/json.hpp * Updated docs * Addd madengine logo icon * Resize the logo icon

…performance, metric, and status show in the correct column (#77) Use header index to replace fixed column order

…to coketaste/refactor-dis

…super) (#78) K8s orchestrator now uses the same reporting path as single-node Docker when multiple_results CSV is present, so both produce the same artifacts: perf.csv, perf_entry.json, perf_entry.csv, perf_super.json, perf_entry_super.json/csv, and perf_super.csv. - Add Docker-compatible reporting path in _collect_results: - _build_common_info_dict() for common_info (args, gpu_arch, etc.) - _ensure_perf_csv_exists() so update_perf_csv can read perf.csv - Call update_perf_csv, update_perf_super_json, update_perf_super_csv with scripts_base_dir so --config from models.json resolves correctly - Multi-node multiple_results: resolve one CSV path per run - _resolve_multiple_results_csv(): single pod → use that CSV; multi-pod → merge all pod CSVs with sum/average rules - _merge_multi_node_multiple_results_csv(): align rows by index; performance aggregated by metric type (throughput→sum, latency→avg, memory→max); extra columns by _aggregation_for_extra_column (sum/avg/max/first) - _aggregation_for_extra_column() for consistent multi-node semantics - Keep legacy row-by-row _write_to_perf_csv when reporting module unavailable; record failure when no CSV found - job.yaml.j2: no functional change required; existing copy block and find fallback for multiple_results already support this refactor

Resolve merge conflicts by keeping refactor-dis (v2) and discarding main (v1) changes: - Remove src/madengine/mad.py and src/madengine/tools/run_models.py (deleted in v2, accept deletion over main's modifications) - Resolve rocenv_tool.py conflict: keep current-branch version for unknown GPU device handling - Resolve tests/fixtures/dummy/models.json: keep v2 fixture set (dummy_superset and full model list) over main's therock-only entry

coketaste added 30 commits July 7, 2025 16:38

Updated the distributed cli interface and clean up the code

3c1da45

Fix the pulling issue from registry

0fb0e53

Updated the docs

ab0bbe6

Created a professional, comprehensive, and maintainable documentation…

81bc4e4

… structure that emphasizes its core strengths in MAD package integration and distributed model execution

make a well-formatted documentation of README

ab36c76

Fix the MODEL_DIR setup issue

85c66de

Fixed the out of date unit tests in distributed cli

91805ae

All syntax errors resolved - file compiles successfully in distribute…

0a1a679

…d_cli unit tests

Fix the test case of distributed integration

ef64de6

Fixed the test profiling

23b3bbb

Updated the fix to handle permssion erro

0fec233

Refine the assertion

b5f6486

Added test cases of mad_cli and distributed integration

7060f76

Massively enhanced distributed execution with runners of SSH, Ansbile…

b65bf0d

…, and K8s; Expanded command line interface;

Reverted somme missing functions

661a9ae

new functionality allows users to provide Docker Hub credentials via …

29ac831

…environment variables, which is particularly useful in CI/CD environments, containerized deployments, or when you want to avoid storing credentials in files

Merge branch 'coketaste/refactor' into coketaste/refactor-runners

8e26033

Changed docker.io to dockerhub

db75808

Merge branch 'coketaste/refactor' into coketaste/refactor-runners

14cc12e

Fix the test case of context

9b09f01

Updated README.md

2a26dbf

Fix the unit test of e2e distributed run with profiling

b35508b

Fixed the issue of mocks gpu

a61c287

Rewrite the unit test gpu version

96d7e27

Fixed the manfiest name error

566f1cb

Fixed the missing manifest file

cbd86c1

Updated the warning message of missing cred

b3052f5

Merge pull request #14 from ROCm/coketaste/refactor-runners

4955bcf

[WIP] Enhanced distributed execution with runners

Updated the MAD_DOCKERHUB_ creds parsing logic

71fe348

Merge branch 'coketaste/refactor' of https://github.com/ROCm/madengine …

49f60dc

…into coketaste/refactor

coketaste added 30 commits February 10, 2026 21:26

Fixed the multi results case on k8s infra

7df733d

Updated config for porfiling

a88e68b

Merge pull request #69 from ROCm/coketaste/refactor-dis-multiresult

5392028

Fix multi-result on k8s

Added nodelist arg for slurm, and improved the configuration module

f6a4b0f

Merge pull request #71 from ROCm/coketaste/refactor-dis-slurm-nodes

44c56ee

Update node list config for SLURM

Updated docs about changes in slurm

f83e8da

Added a new rocprofv3 config mode for tool

e596d3b

XMerge branch 'coketaste/refactor-dis' of https://github.com/ROCm/mad…

f189de1

…engine into coketaste/refactor-dis

Updated rocprofv3_agent tool config

fa091ea

Added rocprofv3_agent_full with counter collection or instruction mix

0839eb9

Updated trace config

48d96e1

Added rocprofv3 thread tracing

c6d5432

Updated the config of thread tracing and need to set ROCPROF_ATT_LIBR…

97aed8a

…ARY_PATH

Updated tools.json and remove thread tracing config for rocprof v3

36ae997

Fixed the error patterns and modified the silent error check via subp…

db83e50

…rocess

Merge pull request #72 from ROCm/coketaste/cleanup-printout

1e21e31

Silent error check via subprocess

Updated the additional context and manifest printout

5840912

Merge pull request #73 from ROCm/coketaste/refactor-dis-printout

04bf518

Update the additional context and manifest printout

Updated interface of madengine report to-html using --csv-file-path

2db7ebc

Merge pull request #74 from ROCm/coketaste/refactor-dis-html

2245503

Update the interface of report to-html

Updated the icon of madengine

1748af5

Updated docs

8dccac4

Updated README

def0f39

Fixed the issue. Appended rows now match the existing header; model, …

2177b68

…performance, metric, and status show in the correct column (#77) Use header index to replace fixed column order

Updated README.md

f3489a4

Merge branch 'coketaste/refactor-dis' of github.com:ROCm/madengine in…

8521632

…to coketaste/refactor-dis

Fixed launcher type issue on k8s

a197d7c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

madengine v2 with unified framework for local and distribution#57

madengine v2 with unified framework for local and distribution#57
coketaste wants to merge 301 commits intomainfrom
coketaste/refactor-dis

coketaste commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

coketaste commented Dec 5, 2025

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants