Skip to content

madengine v2 with unified framework for local and distribution#57

Open
coketaste wants to merge 301 commits intomainfrom
coketaste/refactor-dis
Open

madengine v2 with unified framework for local and distribution#57
coketaste wants to merge 301 commits intomainfrom
coketaste/refactor-dis

Conversation

@coketaste
Copy link
Collaborator

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

coketaste added 30 commits July 7, 2025 16:38
… structure that emphasizes its core strengths in MAD package integration and distributed model execution
…environment variables, which is particularly useful in CI/CD environments, containerized deployments, or when you want to avoid storing credentials in files
[WIP] Enhanced distributed execution with runners
Update the additional context and manifest printout
Update the interface of report to-html
* feat(madengine): ROCm path override and RPD e2e fix

ROCM_PATH / --rocm-path:
- Add get_rocm_path() and wire through Context, ROCmToolManager, gpu_tool_factory,
  gpu_validator, container_runner, run orchestrator; add --rocm-path to run CLI
- Unit tests for get_rocm_path, Context, ROCmToolManager, run --help; update
  fixtures and test_get_cached_managers for (vendor, rocm_path) cache

RPD e2e:
- pre_scripts/trace.sh: install nlohmann-json3-dev (Ubuntu) and json-devel
  (CentOS) so rocmProfileData rpd_tracer build finds nlohmann/json.hpp

* Updated docs

* Addd madengine logo icon

* Resize the logo icon
…performance, metric, and status show in the correct column (#77)

Use header index to replace fixed column order
…super) (#78)

K8s orchestrator now uses the same reporting path as single-node Docker
when multiple_results CSV is present, so both produce the same artifacts:
perf.csv, perf_entry.json, perf_entry.csv, perf_super.json,
perf_entry_super.json/csv, and perf_super.csv.

- Add Docker-compatible reporting path in _collect_results:
  - _build_common_info_dict() for common_info (args, gpu_arch, etc.)
  - _ensure_perf_csv_exists() so update_perf_csv can read perf.csv
  - Call update_perf_csv, update_perf_super_json, update_perf_super_csv
    with scripts_base_dir so --config from models.json resolves correctly

- Multi-node multiple_results: resolve one CSV path per run
  - _resolve_multiple_results_csv(): single pod → use that CSV; multi-pod
    → merge all pod CSVs with sum/average rules
  - _merge_multi_node_multiple_results_csv(): align rows by index;
    performance aggregated by metric type (throughput→sum, latency→avg,
    memory→max); extra columns by _aggregation_for_extra_column (sum/avg/max/first)
  - _aggregation_for_extra_column() for consistent multi-node semantics

- Keep legacy row-by-row _write_to_perf_csv when reporting module
  unavailable; record failure when no CSV found

- job.yaml.j2: no functional change required; existing copy block and
  find fallback for multiple_results already support this refactor
Resolve merge conflicts by keeping refactor-dis (v2) and discarding
main (v1) changes:

- Remove src/madengine/mad.py and src/madengine/tools/run_models.py
  (deleted in v2, accept deletion over main's modifications)
- Resolve rocenv_tool.py conflict: keep current-branch version for
  unknown GPU device handling
- Resolve tests/fixtures/dummy/models.json: keep v2 fixture set
  (dummy_superset and full model list) over main's therock-only entry
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants