ananya ayasi

Notes on AI Safety and Evaluation [Ongoing]

Hello! I am using this space to dump all my notes so those who are getting started in this area can have a basic idea of things. I hope to edit this properly soon but until then excuse the mess.

Why AI safety matters

Ensuring that LLMs (and advanced AI more generally) are safe and benefit humanity is extremely important.
Technical AI safety involves using algorithms to monitor for potentially unsafe behavior (e.g., hacking into a computer system) and to make LLMs more aligned with human values, including fairness and helpfulness.
Many solutions to technical AI safety require understanding the nature of the representations and operations learned by LLMs (e.g.: how does an LLM represent dangerous behavior?). Trying to reverse-engineer LLMs to understand how and what they learn is called mechanistic interpretability.


AI Safety and Alignment


Why this is hard in the real world

AI is increasingly integrated into our lives, societies, decision-making, infrastructure and devices. This poses risks to safety, security, privacy, well-being, equality etc. Risks can never be eliminated but can find a balance between risks and benefits.


Examples of AI risks



Why can’t AI just be safe and moral?


Can’t these models just understand morality and follow it?


In-context learning and Few-shot learning


Zero shot vs one shot vs two shot


ICL summary

In summary, ICL is a challenge for AI safety, because it is difficult to control for unsafe abilities that are not built into the model during training.


Scaling and AI Safety


Chinchilla Scaling Laws

Chinchilla Scaling Laws are an empirical framework developed by DeepMind that define the optimal balance between model size (number of parameters, N) and training data (number of tokens, D) to minimize loss for a given computational budget(C ). The core insight is that model size and training data should scale proportionally, maintaining a near-constant D/N ratio of approximately 20 for large-scale transformer models.

This means that for every doubling of model size, the amount of training data should also double. For example, the 70-billion-parameter Chinchilla model was trained on 1.4 trillion tokens, achieving superior performance compared to larger models like GPT-3 (175B parameters) and Gopher (280B), which were undertrained relative to their size. The laws challenge the previous "bigger is better" paradigm by showing that training smaller models on significantly more data yields better performance at the same compute cost.


Limits of scaling narratives


Evaluating LLMs

There are many ways to evaluate the performance of LLMs. Different “eval” methods focus on different aspects of LLMs’ world knowledge, language abilities, and capabilities.


Quantitative Evaluation

Attach scores to performance of LLMs.

Goals:

Challenges:


Goodhart’s Law


Interactive example

Hack an AI to steal a password game.

Mechanistic Interpretability

Why is MI hard?

Now how exactly does it improve AI safety?

But how may MI increase harm?

Important terms

Observation-based/ Non-causal/ Correlational

Intervention-based/ Causal/ Causational

Interpret trained models

Train interpretable models

Bottom-up

Top-down

Universality

Criticisms of MI

I attended a talk on Goodfire- MI Company by Daniel Balsam

Understand

Debug

Intentional design