First layer residual is inconsistently normalized

For most configurations, the first encoder blocks' pre-norm is skipped due to the embedding layers' post-normalization. 

Yet, having this "pseudo-prenorm" in the embedding layer also means that the residual taken in that first block will be normalized. 
This is inconsistent with how pre-norm is originally formulated, where the norm is *not* applied to the residual (https://arxiv.org/pdf/2002.04745, Fig. 1). 

Any particular reason why it is implemented differently in ModernBERT?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

First layer residual is inconsistently normalized #243

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

First layer residual is inconsistently normalized #243

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions