Notes on AI Safety and Evaluation [Ongoing]

10 Feb, 2026

Hello! I am using this space to dump all my notes so those who are getting started in this area can have a basic idea of things. I hope to edit this properly soon but until then excuse the mess.

Why AI safety matters

Ensuring that LLMs (and advanced AI more generally) are safe and benefit humanity is extremely important.
Technical AI safety involves using algorithms to monitor for potentially unsafe behavior (e.g., hacking into a computer system) and to make LLMs more aligned with human values, including fairness and helpfulness.
Many solutions to technical AI safety require understanding the nature of the representations and operations learned by LLMs (e.g.: how does an LLM represent dangerous behavior?). Trying to reverse-engineer LLMs to understand how and what they learn is called mechanistic interpretability.

AI Safety and Alignment

AI safety: ensuring that AI benefits humanity with minimal risk of harm to safety, health, well-being, economy and environment.
AI alignment: Ensuring that AI does what we want it to do.
These two don’t always go together.
Misaligned but safe: you ask an LLM for help with coding but it writes a poem about penguins
Aligned but unsafe: LLMs help terrorists build a bomb
Cultural differences across regions could dictate what is safe and unsafe as well or what is aligned and misaligned.

Why this is hard in the real world

AI is increasingly integrated into our lives, societies, decision-making, infrastructure and devices. This poses risks to safety, security, privacy, well-being, equality etc. Risks can never be eliminated but can find a balance between risks and benefits.

Examples of AI risks

Personal privacy
Identity theft, and deep fakes
Effective, precision surveillance
Intrusive marketing
Discrimination
Environmental harm
Job loss
Cultural homogenization- boring mish mash of art- what AI thinks what we prefer
Copyright violation of creative and IP
Hypercompetitive algorithmic trading
Targeted mis/ disinformation - political or social influences
Exacerbating economic inequality
Mass harm to humans or environment
Autonomous weapons- faster than humans could intervene

Technical and legal approaches to AI safety

Legal solutions: limit the use or sale of AI infra depending on the applications- but who gets to decide; set regulations on applications, security and privacy; define legal or financial liability for damages.
Technical solutions: develop safe and robust AI systems; add guardrails or “edit out” risk potential; monitoring systems to detect and block harmful behaviour.

Why can’t AI just be safe and moral?

AI is a tool- why can;t [hammers/ cars/ nuclear tech etc] just be safe and moral?
LLMs are powerful, multi-use tools that can be used for benevolent or malevolent purposes.
It is impossible to predict all the ways that LLMs can be used. Is it impossible to prevent all the ways it could be misused?
But AI is not just a tool- LLMs can be given autonomy to make decisions and influence the world in ways that other tools cannot.
Therefore, LLMs (or other advanced AI systems) are not just tools but can have unintended consequences.
It is impossible to predict all the capabilities and consequences of SOTA LLMs. How can we prevent unknown unknowns?

Can’t these models just understand morality and follow it?

Can’t we just make AI moral?
LLMs know a lot about human morality from books, legal docs, laws etc.
But “morals” to an LLM are just token sequence patterns; AIs won’t express morality unless it is relevant to their objective.
No intrinsic feelings of morality like humans do,
Analogy: LLMs know German but won’t express it unless specifically prompted.
LLMs can follow moral principles if trained and instructed to do so which means they can be untrained or instructed otherwise. Moreover, it also means that moral behaviour is just seen as a desired goal, not a real restriction.
Commercial LLMs are trained to be polite but that can change if the user is consistently rude.
Morality and legality varies across cultures and jurisdictions.
Moral decision making may conflict with profit motives of AI development companies.

In-context learning and Few-shot learning

ICL: LLMs can learn new tasks without updating weights using prompted explanations and examples. It’s helpful because updating weights is expensive and impossible with closed LLMs. But LLMs can learn to perform tasks with no fine-tuning and based on little data. Moreover, depending on the complexity of the task, you wouldn’t even need many tokens.
Relevance to safety: the capabilities of LLMs may exceed their training data. ICL could bypass safety guardrails and monitors, allowing LLMs to be used for unintended or unwanted harmful purposes.

Zero shot vs one shot vs two shot

Zero shot: prompt a description of the task with no examples
One shot: prompt a description of the task with one training example
Two shot: provide two training examples.
Models trained on web text- once a test or evaluation set is published on the internet, frontier models will have that incorporated into their training set within months or maybe a year.

ICL summary

In summary, ICL is a challenge for AI safety, because it is difficult to control for unsafe abilities that are not built into the model during training.

Scaling and AI Safety

There are multiple examples of scaling laws like Zipf’s Law, Pareto principle and so on.
Another popular one is Moore's Law- this is not due to some natural phenomenon but because engineers are working hard to keep the law consistent.
Scaling laws refer to any mathematical relationship where one quantity changes in a predictable way relative to another, often involving power functions.
Scaling laws in LLMs (?)
Based on a few years worth of data
As parameters or dataset size increases, the test loss will keep going down until it reaches zero.

Chinchilla Scaling Laws

Chinchilla Scaling Laws are an empirical framework developed by DeepMind that define the optimal balance between model size (number of parameters, N) and training data (number of tokens, D) to minimize loss for a given computational budget(C ). The core insight is that model size and training data should scale proportionally, maintaining a near-constant D/N ratio of approximately 20 for large-scale transformer models.

This means that for every doubling of model size, the amount of training data should also double. For example, the 70-billion-parameter Chinchilla model was trained on 1.4 trillion tokens, achieving superior performance compared to larger models like GPT-3 (175B parameters) and Gopher (280B), which were undertrained relative to their size. The laws challenge the previous "bigger is better" paradigm by showing that training smaller models on significantly more data yields better performance at the same compute cost.

Limits of scaling narratives

Quality text that is available to train models are going down- dataset size, parameters and compute are not mathematical variables that can go on till infinity- there are real world constraints.
Even aside from the practical constraints, infinite growth is not something reasonable to expect in developing new systems.
Trying to predict the tech developments in the next 10 years based on where we are now is not sensible
“AI boomers” hype development to gain media attention, VC investments, usage, self-esteem, followers, ad-click revenue etc.
“AI doomers” hype development to scare people and governments into action to create laws, regulations, policies and enforcement.
“We don’t have flying cars, we don’t live 200 years, we don’t live off-planet, cryptocurrencies are not ubiquitous, we don’t have meal-in-a-pill and we survived Y2K.
In summary, scaling laws describe the past but cannot guarantee the future. Future AI capabilities may vastly exceed current capabilities- or may be only marginally better and people have many motivations to hype future AI capabilities beyond what may be realistic.

Evaluating LLMs

There are many ways to evaluate the performance of LLMs. Different “eval” methods focus on different aspects of LLMs’ world knowledge, language abilities, and capabilities.

Quantitative Evaluation

Attach scores to performance of LLMs.

Goals:

Provide a benchmark for performance.
These numbers reflect the capabilities of an LLM including basic language, world knowledge, and generalization.
Detects unsafe behaviour or risks and can be directly compared across LLMs.

Challenges:

Complexity of language leads to ambiguity of answers
Low or moderate correlation with human performance
Overfitting- companies can fine tune their models on evaluation suits. Even if there is some holdout set that the evaluation team keeps secret, LLMs can still be fine tuned to understand and perform well on the type of assessment that the evaluation method relies on.
No ground truth for open-ended tasks like essay writing or Python coding
Prompt sensitivity (accuracy can change with the quality of your prompt) and stochastic responses.
Scalability

Goodhart’s Law

Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure- not specific to AI
The government offers rewards for dead cobras: people breed cobras.
Hospitals focus on reducing length of stay statistics- patients are discharged before recovery
Schools teach for the test- students don’t learn useful skills
An LLM eval method is deemed important- AI orgs fine-tune their models to ace the eval.

Interactive example

Hack an AI to steal a password game.

Mechanistic Interpretability

Overarching goal: Reverse engineer the internal LLM operations and representations.
Applications: Safety, optimize fine-tuning, improve quantitative evals
Methods: extract and analyze weights and activations during training or inference, manipulate activation in “neurons” or circuits.
Interpretability becomes difficult for multiple variables with interactions.

Why is MI hard?

search space is gigantic- millions to billions of weights and activations for each token.
Representations and operations are not localized. They are distributed, context-dependent and highly non-linear. Moreover, behaviors are emergent, so reductionism does not necessarily lead to insight. Also, a lack of ground truth to verify findings is an issue in MI.

Now how exactly does it improve AI safety?

Harmful actions must be represented somewhere, somehow, inside an LLM.
Those representations can be selectively removed or monitored even if the LLM tries to deceive or hide harmful capabilities.
Understanding how LLMs “think” can help us communicate with them more precisely.
A mechanistic understanding of LLMs can provide better, more targeted and more granular evaluations.

But how may MI increase harm?

MI can accelerate AI development and AI risks are proportional to their development
The tools and ability to selectively delete knowledge and capabilities can be used for censorship.
A strong focus on interpretability could mean reduced resources on directly implementable safety research and implementation.

Important terms

Observation-based/ Non-causal/ Correlational

“Hook” the model to extract activations during forward pass
Provides correlational evidence
Based on empirical or theoretical motivations
Can be exploratory and data-mining

Intervention-based/ Causal/ Causational

Manipulate the internals and measure behavioral effects
Provides causal evidence- really difficult
Generally requires a theory or strong a priori motivation
Less amenable to explorations due to dimension explosion

Interpret trained models

Models are trained to be as capable as possible and MI happens later
AI safety is complex, difficult and not guaranteed
Model growth is limited by data and compute power
The reality for commercial models and for-profit orgs

Train interpretable models

Training uses algorithms and architectures that are human-interpretable
AI safety and control are guaranteed
Model growth is limited by risk and safety considerations.
Less commercial viability

Bottom-up

Begins with low-level “atoms” (activations and weights) and attempts to work up towards explaining a behaviour
Reductionist approaches like this may miss emergent behaviors, are difficult to scale up, and struggle with the plethora of irrelevant neurons

Top-down

Begins with an observed behaviour and attempts to identify patterns or circuits in the model that co-appear with the behaviour
“Psychological” approaches increase risk of statistical Type-1 errors, often have post-hoc interpretations and are difficult to establish causality.
In reality, for research, you need a combination of both.

Universality

The idea that there are principles of learning and task-completion that are expressed in all artificial neural networks, regardless of modality, architecture or training set. Therefore, features and principles discovered in one model will be observed in other models. This means that discoveries made in “toy” models hold for frontier models as well.

Criticisms of MI

Trouble with reproducibility: patterns or circuits can be “discovered” in one model or dataset, but not found in another dataset. But methods and hypotheses should not be abandoned simply because initial findings were inconclusive.
Reductionism: It may be fundamentally intractable to explain complex behaviour using human-interpretable analyses on simple “atoms”. But giving up too early may be a mistake. Other sciences have made enormous progress from reductionist approaches.
Lack of scalability from toy models to SOTA LLMs

I attended a talk on Goodfire- MI Company by Daniel Balsam

Interpretability is the foundational science that turns AI into something we can intentionally design.
Engineering has outpaced fundamental understanding in the case of ML- This might be the first time. Kind of like how steam engines developed pre-thermodynamics.
Need to figure out the fundamental science.
MI also serves as a bridge that allows us to learn new knowledge from superintelligent models.
Understand, debug, intentionally design

Understand

PRIORS IN TIME paper
Genome modelling and design across all domains of life with Evo 2 paper
Finding the Tree of Life in Evo 2
Using interpretability to identify a novel class of Alzheimer’s Biomarkers

Debug

Understanding Memorization via Loss Curvature- factorize weight models to identify weights that are associated with memorization and then prune them to reduce memorization
Reasoning theater: probing for performative CoT- model knows the answer and thinks for a long time anyway sometimes. This is basically wasting a bunch of tokens.

Intentional design

use these insights to improve models
Features as rewards: using interpretability to detect hallucination and train a model to correct itself.
Using self-correcting search to accelerate materials discovery: these models have better understanding of material physics than one might guess and can use that to improve the model.
The research agenda is different from Deepmind but pragmatic interp is central to Goodfire’s work.
No understanding of superintelligence vs a bit of understanding because of MI
Plus it goes well with red-teaming, evals and other sub-fields to make AI better.