Skip to content

Add XPU workflow#1302

Draft
scotts wants to merge 13 commits intopytorch:mainfrom
scotts:ci_xpu
Draft

Add XPU workflow#1302
scotts wants to merge 13 commits intopytorch:mainfrom
scotts:ci_xpu

Conversation

@scotts
Copy link
Copy Markdown
Contributor

@scotts scotts commented Mar 13, 2026

No description provided.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 13, 2026

Warning: Unknown label ciflow/xpu.
Currently recognized labels are

  • ciflow/rocm

Please add the new label to .github/pytorch-probot.yml


jobs:
pr-test:
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because all XPU runners under pytorch org are self-hosted runners maintained by Intel directly for now, it need config ECR permission. I have submitted 2 PRs pytorch/test-infra#7853 and pytorch/pytorch#177831 to enable it

@scotts
Copy link
Copy Markdown
Contributor Author

scotts commented Mar 20, 2026

@chuanqi129, we're still getting the failures. Are there any other steps we need to take?

@chuanqi129
Copy link
Copy Markdown
Contributor

@chuanqi129, we're still getting the failures. Are there any other steps we need to take?

Hi @scotts, I have double checked the failure log, I feel it's very strange, according to the log, seems the PR pytorch/test-infra#7853 & pytorch/test-infra#7860 don't work as expectation. Could you please try to rebase your PR instead of rerun the failed job?

@scotts
Copy link
Copy Markdown
Contributor Author

scotts commented Mar 23, 2026

@chuanqi129, done, rebased and pushed. We're getting different failures now, during "Setup XPU": https://github.com/pytorch/kineto/actions/runs/23438522997/job/68182791603?pr=1302.

@chuanqi129
Copy link
Copy Markdown
Contributor

@chuanqi129, done, rebased and pushed. We're getting different failures now, during "Setup XPU": https://github.com/pytorch/kineto/actions/runs/23438522997/job/68182791603?pr=1302.

Hi @scotts , I have submitted a new PR to address this cross-repo issue, please help to review it pytorch/pytorch#178143

@scotts
Copy link
Copy Markdown
Contributor Author

scotts commented Mar 24, 2026

@chuanqi129, we're getting an error when trying to pull the docker image: https://github.com/pytorch/kineto/actions/runs/23438522997/job/68423941333?pr=1302

@chuanqi129
Copy link
Copy Markdown
Contributor

@chuanqi129, we're getting an error when trying to pull the docker image: https://github.com/pytorch/kineto/actions/runs/23438522997/job/68423941333?pr=1302

Hi @scotts , we don't enable pytorch/almalinux-builder:xpu docker image build in pytorch, so the failure is expected, could you please try the above modify suggestion again?

@chuanqi129
Copy link
Copy Markdown
Contributor

chuanqi129 commented Mar 25, 2026

Hi @scotts , I have checked the latest xpu workflow failure, and created another PR to address it pytorch/pytorch#178380, please help to review it again.

And as for the kineto / pytorch build for xpu, I think we need to do some extra steps for xpu. Will feedback to you later

scotts and others added 2 commits March 27, 2026 10:55
Co-authored-by: Wang, Chuanqi <chuanqi.wang@intel.com>
@scotts
Copy link
Copy Markdown
Contributor Author

scotts commented Mar 27, 2026

@chuanqi129, progress! But now we're getting certificate issues on the host when trying to use conda to update packages: https://github.com/pytorch/kineto/actions/runs/23652712326/job/68936825671?pr=1302#step:16:66

@chuanqi129
Copy link
Copy Markdown
Contributor

chuanqi129 commented Mar 30, 2026

@chuanqi129, progress! But now we're getting certificate issues on the host when trying to use conda to update packages: https://github.com/pytorch/kineto/actions/runs/23652712326/job/68936825671?pr=1302#step:16:66

Thanks @scotts , it should cause by the anaconda default channel can't used on intel owned machines, we can use conda-forge channel. let me check how to fix it

@scotts
Copy link
Copy Markdown
Contributor Author

scotts commented Mar 31, 2026

@chuanqi129, I was curious if it would make a difference if I used conda-forge before you made any changes, and it still fails in the same way. Let me know if there's anything I can do to help!

@chuanqi129
Copy link
Copy Markdown
Contributor

chuanqi129 commented Apr 1, 2026

Hi @scotts , sorry for the late reply, I have tried it locally, it cause the default channel is from anaconda too. I can resolve this issue by below WA

# show current channels
conda config --show channels
# remove defaults channel
conda config --remove channels defaults
# add conda-forge
conda config --add channels conda-forge
conda install -y 'cmake>=3.27'

Could you please help to try it again? If it can resolve the issue, maybe we can consider add this WA into https://github.com/pytorch/pytorch/blob/main/.ci/docker/common/install_conda.sh directly

@chuanqi129
Copy link
Copy Markdown
Contributor

Thanks @scotts , we got new failure now https://github.com/pytorch/kineto/actions/runs/23869014674/job/69595758258?pr=1302#step:16:1692. This failure caused by the xpu env source scripts has unbound vars. So that we need use set +u before source xpu env for now.

@scotts
Copy link
Copy Markdown
Contributor Author

scotts commented Apr 2, 2026

@chuanqi129, more progress! We're actually compiling Kineto now, but we're hitting some linker errors. Maybe the environment isn't set up correctly? https://github.com/pytorch/kineto/actions/runs/23880961999/job/69633881882?pr=1302#step:16:2346

@chuanqi129
Copy link
Copy Markdown
Contributor

Thanks @scotts for the progress update, per the linker error, I have invited our developer @moksiuc help to double check. As we can build kineto xpu with pytorch xpu success, there is no reason we can't build kineto xpu standalone. Let's check it.

@chuanqi129
Copy link
Copy Markdown
Contributor

chuanqi129 commented Apr 6, 2026

Hi @scotts, for the linker error, I have submit a PR #1349 to fix it with copilot help and verified the fix in local env. Please help to review it. CC @moksiuc

@scotts
Copy link
Copy Markdown
Contributor Author

scotts commented Apr 9, 2026

@chuanqi129, the build succeeded and we're running tests! 🎉 It looks like we have two potential issues remaining:

  1. A warning is flooding the logs: warning: Double arithmetic operation is not supported on this platform with FP64 conversion emulation mode (poison FP64 kernels is enabled).. I don't see that message anywhere in PyTorch or Kineto repos, so I suspect that hardware itself is emitting that message. I found FP64 Emulation Support is Broken; Cannot Run Own Scripts intel/intel-extension-for-pytorch#257. Is there a setting we need in the environment scripts?
  2. There's an infra error uploading results files: https://github.com/pytorch/kineto/actions/runs/24159003509/job/70505315113?pr=1302#step:19:55

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants