For most configurations, the first encoder blocks' pre-norm is skipped due to the embedding layers' post-normalization.
Yet, having this "pseudo-prenorm" in the embedding layer also means that the residual taken in that first block will be normalized.
This is inconsistent with how pre-norm is originally formulated, where the norm is not applied to the residual (https://arxiv.org/pdf/2002.04745, Fig. 1).
Any particular reason why it is implemented differently in ModernBERT?