12. Extra materials
Contents
12. Extra materials¶
From GenAI Handbook
Chapter 31: LLMs for Synthetic Data¶
An increasing number of applications are making use of LLM-generated data for training or evaluations, including distillation, dataset augmentation, AI-assisted evaluation and labeling, self-critique, and more. ThisĀ postĀ demonstrates how to construct such a synthetic dataset (in a RAG context), and thisĀ postĀ from Argilla gives an overview of RLAIF, which is often a popular alternative to RLHF, given the challenges associated with gathering pairwise human preference data. AI-assisted feedback is also a central component of the āConstitutional AIā alignment method pioneered by Anthropic (see theirĀ blogĀ for an overview).
Chapter 32: Representation Engineering¶
Representation Engineering is a new and promising technique for fine-grained steering of language model outputs via ācontrol vectorsā. Somewhat similar to LoRA adapters, it has the effect of adding low-rank biases to the weights of a network which can elicit particular response styles (e.g. āhumorousā, āverboseā, ācreativeā, āhonestā), yet is much more computationally efficient and can be implemented without any training required. Instead, the method simply looks at differences in activations for pairs of inputs which vary along the axis of interest (e.g. honesty), which can be generated synthetically, and then performs dimensionality reduction.
See this shortĀ blog postĀ from Center for AI Safety (who pioneered the method) for a brief overview, and thisĀ postĀ from Theia Vogel for a technical deep-dive with code examples. Theia also walks through the method in thisĀ podcast episode.
Chapter 33: Mechanistic Interpretability¶
Mechanistic Interpretability (MI) is the dominant paradigm for understanding the inner workings of LLMs by identifying sparse representations of āfeaturesā or ācircuitsā encoded in model weights. Beyond enabling potential modification or explanation of LLM outputs, MI is often viewed as an important step towards potentially āaligningā increasingly powerful systems. Most of the references here will come fromĀ Neel Nanda, a leading researcher in the field whoās created a large number of useful educational resources about MI across a range of formats:
āA Comprehensive Mechanistic Interpretability Explainer & Glossaryā
āAn Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papersā
āMechanistic Interpretability Quickstart GuideāĀ (Neel Nanda on LessWrong)
āHow useful is mechanistic interpretability?āĀ (Neel and others, discussion on LessWrong)
ā200 Concrete Problems In InterpretabilityāĀ (Annotated spreadsheet of open problems from Neel)
Additionally, the articlesĀ āToy Models of SuperpositionāĀ andĀ āScaling Monosemanticity: Extracting Interpretable Features from Claude 3 SonnetāĀ from Anthropic are on the longer side, but feature a number of great visualizations and demonstrations of these concepts.
Chapter 34: Linear Representation Hypotheses¶
An emerging theme from several lines of interpretability research has been the observation that internal representations of features in Transformers are often ālinearā in high-dimensional space (a la Word2Vec). On one hand this may appear initially surprising, but itās also essentially an implicit assumption for techniques like similarity-based retrieval, merging, and the key-value similarity scores used by attention. See thisĀ blog postĀ by Beren Millidge, thisĀ talkĀ from Kiho Park, and perhaps at least skim the paperĀ āLanguage Models Represent Space and TimeāĀ for its figures.