2 months ago

LMEnt: A Suite for Analyzing Knowledge in Language Models from Pretraining Data to Representations

Daniela Gottesman Alon Gilae-Dotan Ido Cohen Yoav Gur-Arieh Marius Mosbach Ori Yoran Mor Geva

Abstract

Language models (LMs) increasingly drive real-world applications that requireworld knowledge. However, the internal processes through which models turn datainto representations of knowledge and beliefs about the world, are poorlyunderstood. Insights into these processes could pave the way for developing LMswith knowledge representations that are more consistent, robust, and complete.To facilitate studying these questions, we present LMEnt, a suite for analyzingknowledge acquisition in LMs during pretraining. LMEnt introduces: (1) aknowledge-rich pretraining corpus, fully annotated with entity mentions, basedon Wikipedia, (2) an entity-based retrieval method over pretraining data thatoutperforms previous approaches by as much as 80.4%, and (3) 12 pretrainedmodels with up to 1B parameters and 4K intermediate checkpoints, withcomparable performance to popular open-sourced models on knowledge benchmarks.Together, these resources provide a controlled environment for analyzingconnections between entity mentions in pretraining and downstream performance,and the effects of causal interventions in pretraining data. We show theutility of LMEnt by studying knowledge acquisition across checkpoints, findingthat fact frequency is key, but does not fully explain learning trends. Werelease LMEnt to support studies of knowledge in LMs, including knowledgerepresentations, plasticity, editing, attribution, and learning dynamics.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

LMEnt: A Suite for Analyzing Knowledge in Language Models from Pretraining Data to Representations

Daniela Gottesman Alon Gilae-Dotan Ido Cohen Yoav Gur-Arieh Marius Mosbach Ori Yoran Mor Geva

Abstract

Build AI with AI

Hyper Newsletters