Goals & Motivation
Why Explore LLM Internals?
Language models are powerful tools, but using them effectively - and understanding their limitations - requires going beyond treating them as black boxes. This project aims to build intuition for what's actually happening inside these systems.
Areas of Exploration
Architecture Fundamentals
Understanding the building blocks that make language models work:
- Transformer layers - How self-attention and feed-forward networks combine
- Attention mechanisms - What patterns different attention heads learn
- Residual streams - How information flows through skip connections
- MLPs - What the feed-forward layers compute and represent
- Positional encodings - How models track sequence position
Representations
How information from training corpora gets encoded:
- Weight matrices - What linguistic patterns are captured in parameters
- Activation patterns - How inputs are transformed through layers
- Embedding spaces - Geometric structure of token representations
- Feature formation - How interpretable features emerge from training
Interpretability Techniques
Methods for understanding what models have learned:
- Probing classifiers - Testing what information activations contain
- Sparse autoencoders - Decomposing activations into interpretable features
- Activation patching - Identifying causal relationships between components
- Steering vectors - Controlling model behavior through activation edits
- Mechanistic interpretability - Reverse-engineering learned circuits
Fine-Tuning & Adaptation
What actually changes during training:
- LoRA - Low-rank adaptation for efficient fine-tuning
- Adapters - Modular components for task-specific behavior
- Weight changes - Tracking what parameters shift during training
- Catastrophic forgetting - Understanding and mitigating knowledge loss
Learning Approach
Experiment-Driven
Rather than just reading papers, we:
- Implement from scratch - Build neural networks to understand fundamentals
- Visualize everything - Create tools to see what's happening
- Run experiments - Test hypotheses with actual code
- Iterate rapidly - Use TDD to build reliable tools quickly
Scale Appropriately
Start simple and build up:
- XOR neural network → Understand backprop and training dynamics
- Small transformers → Learn attention and layer interactions
- GPT-2 → Explore a real language model architecture
- Interpretability tools → Probe and understand larger models
Document Learnings
Capture insights along the way:
- What worked - Successful experiments and approaches
- What didn't - Failed hypotheses and dead ends
- Surprises - Unexpected findings and confusing results
- Open questions - Mysteries still to investigate
Success Metrics
This project is successful if it leads to:
- Intuitive understanding - Can I explain how transformers work to someone else?
- Practical tools - Do the visualizations and analysis tools aid learning?
- Testable predictions - Can I predict model behavior before running experiments?
- Reusable knowledge - Do the patterns and insights transfer to new problems?
Why Public?
Learning in public serves multiple purposes:
- Accountability - Documenting progress creates motivation to continue
- Clarity - Explaining concepts forces deeper understanding
- Feedback - Others can point out mistakes or suggest improvements
- Shared benefit - These notes might help others on similar journeys
Non-Goals
This project explicitly is NOT:
- Production-ready tools - Code prioritizes clarity over performance
- Novel research - We're learning established concepts, not inventing new ones
- Comprehensive coverage - Deep understanding of selected topics beats shallow breadth
- Competitive ML - No leaderboards, benchmarks, or SOTA comparisons
The focus is understanding, not publication or production use.
Next Steps
With these goals in mind, let's see what we've built so far →