Conversation
yiyixuxu
left a comment
There was a problem hiding this comment.
thanks for the PR!
i left some feedbacks
yiyixuxu
left a comment
There was a problem hiding this comment.
thanks!
i left a few more comments
yiyixuxu
left a comment
There was a problem hiding this comment.
thanks! left two small comments
let's merge this soon
tests/models/transformers/test_models_transformer_ernie_image.py
Outdated
Show resolved
Hide resolved
|
@claude can you do a review here also? please keep these 3 note in mind as well during your review
|
|
I'll analyze this and get back to you. |
|
@bot /style |
|
Style fix is beginning .... View the workflow run here. |
|
can you add the new doc pages to https://github.com/huggingface/diffusers/actions/runs/24222913924/job/70733036127?pr=13432#step:16:80 and also run |
| from ...utils import BaseOutput | ||
| from ..normalization import RMSNorm | ||
| from ..attention_processor import Attention | ||
| from ..attention_dispatch import dispatch_attention_fn | ||
| from ..attention import AttentionMixin, AttentionModuleMixin |
There was a problem hiding this comment.
| from ...utils import BaseOutput | |
| from ..normalization import RMSNorm | |
| from ..attention_processor import Attention | |
| from ..attention_dispatch import dispatch_attention_fn | |
| from ..attention import AttentionMixin, AttentionModuleMixin | |
| from ...utils import BaseOutput, logging | |
| from ..normalization import RMSNorm | |
| from ..attention_processor import Attention | |
| from ..attention_dispatch import dispatch_attention_fn | |
| from ..attention import AttentionMixin, AttentionModuleMixin | |
| logger = logging.get_logger(__name__) # pylint: disable=invalid-name |
logger is used in line 216 below:
but is not currently defined.
| torch.backends.cuda.matmul.allow_tf32 = False | ||
|
|
||
|
|
||
| class ErnieImageTransformerTests(ModelTesterMixin, unittest.TestCase): |
There was a problem hiding this comment.
actually can you write test in the new format, using BaseModelTesterConfig,
see this PR as reference https://github.com/huggingface/diffusers/pull/13344/changes
| width=1024, | ||
| num_inference_steps=50, | ||
| guidance_scale=5.0, | ||
| generator=generator, |
There was a problem hiding this comment.
| generator=generator, | |
| generator=torch.Generator("cuda").manual_seed(42), |
generator is used here but not defined in the example.
| width=1024, | ||
| num_inference_steps=8, | ||
| guidance_scale=5.0, | ||
| generator=generator, |
There was a problem hiding this comment.
| generator=generator, | |
| generator=torch.Generator("cuda").manual_seed(42), |
Same suggestion as in #13432 (comment).
|
|
||
| pipe = ErnieImagePipeline.from_pretrained("baidu/ERNIE-Image", torch_dtype=torch.bfloat16) | ||
| pipe.to("cuda") | ||
| # 如果显存不足,可以开启offload |
There was a problem hiding this comment.
| # 如果显存不足,可以开启offload | |
| # If you are running low on GPU VRAM, you can enable offloading |
nit: use English translation of comment since this file is in the en docs
|
|
||
| pipe = ErnieImagePipeline.from_pretrained("baidu/ERNIE-Image-Turbo", torch_dtype=torch.bfloat16) | ||
| pipe.to("cuda") | ||
| # 如果显存不足,可以开启offload |
There was a problem hiding this comment.
| # 如果显存不足,可以开启offload | |
| # If you are running low on GPU VRAM, you can enable offloading |
Same as #13432 (comment).
| self.adaLN_mlp_ln = RMSNorm(hidden_size, eps=eps) | ||
| self.mlp = ErnieImageFeedForward(hidden_size, ffn_hidden_size) | ||
|
|
||
| def forward(self, x, rotary_pos_emb, shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp, attention_mask=None): |
There was a problem hiding this comment.
| def forward(self, x, rotary_pos_emb, shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp, attention_mask=None): | |
| def forward(self, x, rotary_pos_emb, temb: tuple[torch.Tensor, ...], attention_mask: torch.Tensor | None = None): | |
| shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = temb |
nit: I think it would be a little cleaner if we put all the modulation parameters as a tuple in a temb argument and then unpacked it inside forward.
| pe=pe, | ||
| pe_tokenizer=pe_tokenizer, | ||
| ) | ||
| self.vae_scale_factor = 16 # VAE downsample factor |
There was a problem hiding this comment.
| self.vae_scale_factor = 16 # VAE downsample factor | |
| self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels)) if getattr(self, "vae", None) else 16 # VAE downsample factor |
nit: I think it would be better to try to get the VAE scale factor from the VAE config if possible so that it's easier to use different VAEs if necessary (not sure if the suggestion is exactly right).
| # Latent dimensions | ||
| latent_h = height // self.vae_scale_factor | ||
| latent_w = width // self.vae_scale_factor | ||
| latent_channels = 128 # After patchify |
There was a problem hiding this comment.
| latent_channels = 128 # After patchify | |
| latent_channels = self.transformer.config.in_channels # 128 after patchify |
nit: get latent_channels from the transformer config so that the pipeline code is more robust to different transformers
dg845
left a comment
There was a problem hiding this comment.
Thanks for the PR! Left a few small comments.

What does this PR do?
We have introduced a new text-to-image model called ERNIE-Image, which will soon be open-sourced to the community. This PR includes the model architecture definition, the pipeline, as well as the related documentation and test files.
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.