12. Extra materials¶

From GenAI Handbook

Chapter 31: LLMs for Synthetic Data¶

An increasing number of applications are making use of LLM-generated data for training or evaluations, including distillation, dataset augmentation, AI-assisted evaluation and labeling, self-critique, and more. ThisĀ postĀ demonstrates how to construct such a synthetic dataset (in a RAG context), and thisĀ postĀ from Argilla gives an overview of RLAIF, which is often a popular alternative to RLHF, given the challenges associated with gathering pairwise human preference data. AI-assisted feedback is also a central component of the ā€œConstitutional AIā€ alignment method pioneered by Anthropic (see theirĀ blogĀ for an overview).

Chapter 32: Representation Engineering¶

Representation Engineering is a new and promising technique for fine-grained steering of language model outputs via ā€œcontrol vectorsā€. Somewhat similar to LoRA adapters, it has the effect of adding low-rank biases to the weights of a network which can elicit particular response styles (e.g. ā€œhumorousā€, ā€œverboseā€, ā€œcreativeā€, ā€œhonestā€), yet is much more computationally efficient and can be implemented without any training required. Instead, the method simply looks at differences in activations for pairs of inputs which vary along the axis of interest (e.g. honesty), which can be generated synthetically, and then performs dimensionality reduction.

See this shortĀ blog postĀ from Center for AI Safety (who pioneered the method) for a brief overview, and thisĀ postĀ from Theia Vogel for a technical deep-dive with code examples. Theia also walks through the method in thisĀ podcast episode.

Chapter 33: Mechanistic Interpretability¶

Mechanistic Interpretability (MI) is the dominant paradigm for understanding the inner workings of LLMs by identifying sparse representations of ā€œfeaturesā€ or ā€œcircuitsā€ encoded in model weights. Beyond enabling potential modification or explanation of LLM outputs, MI is often viewed as an important step towards potentially ā€œaligningā€ increasingly powerful systems. Most of the references here will come fromĀ Neel Nanda, a leading researcher in the field who’s created a large number of useful educational resources about MI across a range of formats:

Additionally, the articlesĀ ā€œToy Models of Superpositionā€Ā andĀ ā€œScaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnetā€Ā from Anthropic are on the longer side, but feature a number of great visualizations and demonstrations of these concepts.

Chapter 34: Linear Representation Hypotheses¶

An emerging theme from several lines of interpretability research has been the observation that internal representations of features in Transformers are often ā€œlinearā€ in high-dimensional space (a la Word2Vec). On one hand this may appear initially surprising, but it’s also essentially an implicit assumption for techniques like similarity-based retrieval, merging, and the key-value similarity scores used by attention. See thisĀ blog postĀ by Beren Millidge, thisĀ talkĀ from Kiho Park, and perhaps at least skim the paperĀ ā€œLanguage Models Represent Space and Timeā€Ā for its figures.