A new, excellent paper (Gisserot-Boukhlef et al., 2025)1 just came out analyzing whether masked language models (MLMs) are still better than causal language models (CLMs) at generating text embeddings.2 On a variety of downstream tasks and across different desiderata, they observed surprisingly mixed results between the two modeling approaches. Notably, they compared the CLM pretraining recipe with the traditional MLM pretraining recipe (i.e. a fixed masking rate, albeit with a sweep across different rates: 20%, 30%, 40%, and 50%). My following comments are not at all a criticism of this paper. Indeed, it is correct to evaluate the traditional recipe, because this is how MLMs are actually trained in practice.
But can we conclude from this there is little to be learned from pretraining on more than next-token-prediction? I think not. There are multiple differences between the CLM recipe and the traditional MLM recipe that are confounded with each other.
There is (of course) the fact that CLMs impose left-to-right asymmetry, which means that for each predicted token, tokens to the left (the prefix) are all observed and tokens to the right (the suffix) are all masked. Meanwhile, for MLMs, both the prefix and the suffix can contain a mixture of observed and masked tokens.
CLMs and MLMs have different distributions of masking. For CLMs, the masking rate is ~ Uniform[0, 1] so 50% in expectation. (When one evaluates the loglikelihood of a sequence of length L autoregressively, the first term has masking fraction of 1, the second has (L-1)/L, ..., the last has 1/L.) Meanwhile, for traditional MLMs, the masking rate is a fixed number (eg 20%). This can be further broken down:
Traditional MLMs (beginning with BERT at 15%) have tended to use less masking than the 50% rate for CLMs. But recent MLM papers (such as the one above) are closing the gap. Increasing the masking rate increases the difficulty of the prediction task, forcing the models to become “smarter.”
An MLM with a masking rate of 50% means that each predicted token will always see 50% of the remaining sequence. But with a CLM, this varies uniformly in [0, 1], so the task is sometimes much easier and sometimes much harder. Uniform-rate MLMs (Liao et al, 2020)3 close this gap, and are very underrated in my opinion. For example, recent work by Kitouni et al (2024)4 discovered that this avoids the “reversal curse” commonly noticed with CLMs.
The masked tokens in CLMs and MLMs have differing degrees of contiguity with each other. With CLMs, all masked tokens form a contiguous sequence; with MLMs, the masked tokens are randomly scattered. For types of sequence data with a high level of local correlations (such as genome sequences, due to linkage disequilibrium) the random scattering of masked tokens with MLMs makes the prediction task too easy, in my opinion. Span-based masking (Joshi et al, 2020)5 has shown promise at alleviating this.
Thus CLM pretraining is one particular recipe, but “MLM pretraining” is really just the name we give to any mask-based pretraining recipe that is not CLM pretraining. There are many ways to mask a sequence once you release yourself from the constraint that the prefix is fully-observed and the suffix is fully-masked! With MLMs, one can choose the average masking rate to be either greater or lesser than that of CLMs. One can choose the variance of the masking rate distribution to be anywhere between zero (as is traditionally done) to potentially even greater than that of CLMs. And one can use span-based masking to tune the contiguity of the masked tokens.
The diversity of possible MLM recipes (which can vary substantially in resulting model behavior and performance) has a few implications. First, we cannot generalize the results for a particular MLM recipe to other, very different recipes. Second, we cannot necessarily attribute differences in observed performance to the left-to-right-ness of CLMs, compared to the bidirectionality of MLMs. Thirdly, the vast space of MLM recipes, combined with the current overwhelming focus on CLMs, means that MLM training is underexplored. Not only are the MLM recipe hyperparameters underexplored, we also don’t know how to optimally tune data selection and optimization algorithms given the choice of a particular MLM recipe.
Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Manuel Faysse, Duarte M. Alves, Emmanuel Malherbe, André F. T. Martins, Céline Hudelot, and Pierre Colombo. Should We Still Pretrain Encoders with Masked Language Modeling? arXiv preprint arXiv:2507.00994, 2025.
What do we mean by CLMs and MLMs? MLMs involve masking random tokens in each sequence; the model is then trained to correctly predict each one, conditioned on the rest of the (partially masked) sequence. CLMs involve iterating through tokens in each sequence; the model is then trained to correctly predict each one, conditioned on the portion of the sequence that came before it. This is closely related, but not equivalent, to causal attention and bidirectional attention, which are how these different pretraining approaches are performed with Transformer models. With causal attention (and only causal attention), the post-softmax attention matrix is constrained to be lower-triangular.
Liao, Yi, Xin Jiang, and Qun Liu. "Probabilistically Masked Language Model Capable of Autoregressive Generation in Arbitrary Word Order." In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 263-274. 2020.
Kitouni, Ouail, Niklas S. Nolte, Adina Williams, Michael Rabbat, Diane Bouchacourt, and Mark Ibrahim. "The factorization curse: Which tokens you predict underlie the reversal curse and more." Advances in Neural Information Processing Systems 37 (2024): 112329-112355.
Joshi, Mandar, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. "Spanbert: Improving pre-training by representing and predicting spans." Transactions of the association for computational linguistics 8 (2020): 64-77.