Skip to content

[WIP]Full fine tune support#3

Open
poryfly wants to merge 1 commit into
kvcache-ai:sft-v5from
poryfly:full-fine-tune-support
Open

[WIP]Full fine tune support#3
poryfly wants to merge 1 commit into
kvcache-ai:sft-v5from
poryfly:full-fine-tune-support

Conversation

@poryfly

@poryfly poryfly commented May 22, 2026

Copy link
Copy Markdown

What does this PR do?

support full finetune for kt

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates KTransformers (KT) support into Accelerate, introducing the KTransformersPlugin and logic to handle KT expert modules within FSDP2. It also implements optimized communication primitives for Neuron devices that use power-of-2 padding to stabilize tensor shapes and reduce recompilations. Feedback highlights several issues: setup.py incorrectly pins a non-existent Torch version and renames the package for a fork, while the code contains an invalid Torch version check (2.7.0). Furthermore, the FSDP2 module-ignoring logic fails if ignored_modules is a string, and the implementation uses the deprecated torch.ByteStorage instead of torch.frombuffer.

Comment thread setup.py
"psutil",
"pyyaml",
"torch>=2.0.0",
"torch==2.9.1",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Pinning torch to version 2.9.1 is problematic as this version does not exist yet (current stable is 2.5.x). This will prevent the package from being installed in standard environments. Consider using a valid version range.

Suggested change
"torch==2.9.1",
"torch>=2.0.0",

Comment thread setup.py
Comment on lines +58 to +67
name="accelerate-kt",
version="1.14.0.post1",
description="Accelerate",
long_description=open("README.md", encoding="utf-8").read(),
long_description_content_type="text/markdown",
keywords="deep learning",
license="Apache",
author="The Hugging Face team",
author_email="transformers@huggingface.co",
url="https://github.com/huggingface/accelerate",
url="https://github.com/kvcache-ai/accelerate",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Renaming the package to accelerate-kt and changing the URL to a fork repository appears to be intended for a private fork. If this pull request is intended for the main accelerate repository, these changes should be reverted to maintain consistency with the upstream project.

Comment on lines +1728 to +1734
if self.state.fsdp_plugin.ignored_modules is None:
self.state.fsdp_plugin.ignored_modules = []
for wrapper in kt_wrappers:
experts_attr = getattr(wrapper, "_experts_attr", "experts")
experts = getattr(wrapper, experts_attr, None)
if experts is not None and experts not in self.state.fsdp_plugin.ignored_modules:
self.state.fsdp_plugin.ignored_modules.append(experts)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic for updating ignored_modules assumes it is a list. However, FullyShardedDataParallelPlugin.ignored_modules can also be a string (regex pattern). If it is a string, line 1733 will raise a TypeError and line 1734 will raise an AttributeError. Please ensure ignored_modules is handled correctly when it is a string (e.g., by converting the regex to a list of modules first).

if (not param.requires_grad) and (not param.is_floating_point()) and (not param.is_complex()):
incompatible_params4bit.add(param)

if incompatible_params4bit and is_torch_version(">=", "2.7.0"):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Checking for torch version >= 2.7.0 seems incorrect as this version is not yet released. Please verify if this should be a lower version (e.g., 2.4.0 or 2.5.0) where FSDP2 features were stabilized.

Comment on lines +470 to +471
byte_storage = torch.ByteStorage._from_buffer(raw_bytes)
input_tensor = torch.ByteTensor(byte_storage).to(device)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

torch.ByteStorage is deprecated. Consider using torch.frombuffer which is more idiomatic and avoids using internal methods like _from_buffer.

Suggested change
byte_storage = torch.ByteStorage._from_buffer(raw_bytes)
input_tensor = torch.ByteTensor(byte_storage).to(device)
input_tensor = torch.frombuffer(raw_bytes, dtype=torch.uint8).to(device)

@poryfly poryfly changed the base branch from main to sft-v5 May 22, 2026 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant