Conversation
|
Warning: Unknown label
Please add the new label to .github/pytorch-probot.yml |
|
|
||
| jobs: | ||
| pr-test: | ||
| uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main |
There was a problem hiding this comment.
Because all XPU runners under pytorch org are self-hosted runners maintained by Intel directly for now, it need config ECR permission. I have submitted 2 PRs pytorch/test-infra#7853 and pytorch/pytorch#177831 to enable it
|
@chuanqi129, we're still getting the failures. Are there any other steps we need to take? |
Hi @scotts, I have double checked the failure log, I feel it's very strange, according to the log, seems the PR pytorch/test-infra#7853 & pytorch/test-infra#7860 don't work as expectation. Could you please try to rebase your PR instead of rerun the failed job? |
|
@chuanqi129, done, rebased and pushed. We're getting different failures now, during "Setup XPU": https://github.com/pytorch/kineto/actions/runs/23438522997/job/68182791603?pr=1302. |
Hi @scotts , I have submitted a new PR to address this cross-repo issue, please help to review it pytorch/pytorch#178143 |
|
@chuanqi129, we're getting an error when trying to pull the docker image: https://github.com/pytorch/kineto/actions/runs/23438522997/job/68423941333?pr=1302 |
Hi @scotts , we don't enable |
|
Hi @scotts , I have checked the latest xpu workflow failure, and created another PR to address it pytorch/pytorch#178380, please help to review it again. And as for the kineto / pytorch build for xpu, I think we need to do some extra steps for xpu. Will feedback to you later |
Co-authored-by: Wang, Chuanqi <chuanqi.wang@intel.com>
|
@chuanqi129, progress! But now we're getting certificate issues on the host when trying to use |
Thanks @scotts , it should cause by the anaconda default channel can't used on intel owned machines, we can use conda-forge channel. let me check how to fix it |
|
@chuanqi129, I was curious if it would make a difference if I used conda-forge before you made any changes, and it still fails in the same way. Let me know if there's anything I can do to help! |
|
Hi @scotts , sorry for the late reply, I have tried it locally, it cause the default channel is from anaconda too. I can resolve this issue by below WA Could you please help to try it again? If it can resolve the issue, maybe we can consider add this WA into https://github.com/pytorch/pytorch/blob/main/.ci/docker/common/install_conda.sh directly |
|
Thanks @scotts , we got new failure now https://github.com/pytorch/kineto/actions/runs/23869014674/job/69595758258?pr=1302#step:16:1692. This failure caused by the xpu env source scripts has unbound vars. So that we need use |
|
@chuanqi129, more progress! We're actually compiling Kineto now, but we're hitting some linker errors. Maybe the environment isn't set up correctly? https://github.com/pytorch/kineto/actions/runs/23880961999/job/69633881882?pr=1302#step:16:2346 |
|
@chuanqi129, the build succeeded and we're running tests! 🎉 It looks like we have two potential issues remaining:
|
No description provided.