Skip to content

Goals & Motivation

Why Explore LLM Internals?

Language models are powerful tools, but using them effectively - and understanding their limitations - requires going beyond treating them as black boxes. This project aims to build intuition for what's actually happening inside these systems.

Areas of Exploration

Architecture Fundamentals

Understanding the building blocks that make language models work:

  • Transformer layers - How self-attention and feed-forward networks combine
  • Attention mechanisms - What patterns different attention heads learn
  • Residual streams - How information flows through skip connections
  • MLPs - What the feed-forward layers compute and represent
  • Positional encodings - How models track sequence position

Representations

How information from training corpora gets encoded:

  • Weight matrices - What linguistic patterns are captured in parameters
  • Activation patterns - How inputs are transformed through layers
  • Embedding spaces - Geometric structure of token representations
  • Feature formation - How interpretable features emerge from training

Interpretability Techniques

Methods for understanding what models have learned:

  • Probing classifiers - Testing what information activations contain
  • Sparse autoencoders - Decomposing activations into interpretable features
  • Activation patching - Identifying causal relationships between components
  • Steering vectors - Controlling model behavior through activation edits
  • Mechanistic interpretability - Reverse-engineering learned circuits

Fine-Tuning & Adaptation

What actually changes during training:

  • LoRA - Low-rank adaptation for efficient fine-tuning
  • Adapters - Modular components for task-specific behavior
  • Weight changes - Tracking what parameters shift during training
  • Catastrophic forgetting - Understanding and mitigating knowledge loss

Learning Approach

Experiment-Driven

Rather than just reading papers, we:

  1. Implement from scratch - Build neural networks to understand fundamentals
  2. Visualize everything - Create tools to see what's happening
  3. Run experiments - Test hypotheses with actual code
  4. Iterate rapidly - Use TDD to build reliable tools quickly

Scale Appropriately

Start simple and build up:

  • XOR neural network → Understand backprop and training dynamics
  • Small transformers → Learn attention and layer interactions
  • GPT-2 → Explore a real language model architecture
  • Interpretability tools → Probe and understand larger models

Document Learnings

Capture insights along the way:

  • What worked - Successful experiments and approaches
  • What didn't - Failed hypotheses and dead ends
  • Surprises - Unexpected findings and confusing results
  • Open questions - Mysteries still to investigate

Success Metrics

This project is successful if it leads to:

  • Intuitive understanding - Can I explain how transformers work to someone else?
  • Practical tools - Do the visualizations and analysis tools aid learning?
  • Testable predictions - Can I predict model behavior before running experiments?
  • Reusable knowledge - Do the patterns and insights transfer to new problems?

Why Public?

Learning in public serves multiple purposes:

  1. Accountability - Documenting progress creates motivation to continue
  2. Clarity - Explaining concepts forces deeper understanding
  3. Feedback - Others can point out mistakes or suggest improvements
  4. Shared benefit - These notes might help others on similar journeys

Non-Goals

This project explicitly is NOT:

  • Production-ready tools - Code prioritizes clarity over performance
  • Novel research - We're learning established concepts, not inventing new ones
  • Comprehensive coverage - Deep understanding of selected topics beats shallow breadth
  • Competitive ML - No leaderboards, benchmarks, or SOTA comparisons

The focus is understanding, not publication or production use.

Next Steps

With these goals in mind, let's see what we've built so far →

A hands-on learning project exploring LLM internals