Compression Navigator
An LLM is a lossy codec for text. Training compresses a corpus into weights; a forward pass decompresses a continuation. These five tools let you watch that decompression and find where facts physically live.
Each tab is a real interpretability technique: logit lens, embedding neighbours, activation steering, cross-model diff, and causal tracing (ROME).
Three models, on purpose
| name | how it stores facts | what it teaches |
|---|---|---|
glassbox |
key→value writes into the residual stream (like a real transformer / what ROME edits) | the tools work and are verifiable against ground truth you can read in the source |
handmade |
a lookup table keyed on the prompt string (a side channel) | a model can be invisible to residual-stream interpretability — a real limitation |
gpt2 |
learned, fuzzy, distributed over many layers | what the real, messy thing looks like |
Suggested order: load glassbox first (see "correct"), then handmade
(see a failure mode), then gpt2 (see reality). Type a name below and Load.
Logit lens — watch the answer condense, layer by layer
What it does: takes the last-token residual at every layer and reads it through the unembedding, as if the model had to answer right there. You see the prediction form.
How to read it: each row is a layer. Watch your tracked token's probability (right column) climb, and watch entropy (bits) fall as the model commits.
Ground truth to check:
glassbox—parisis ~0 until L3 (the readout right after the fact-MLP), then jumps to ~0.51. Sharp and localised because you put it there.handmade— the answer snaps to 1.00 at L1 with zero build-up (it's a lookup, not a computation).gpt2— the answer accretes gradually across many middle/late layers. That smear is what "distributed representation" actually looks like.
(Numbering note: the lens counts from the embedding, so L1 is after the first block. The causal-trace tab counts blocks from L0. So the fact-MLP is lens-L3 / trace-block-L2, and its causal site shows at trace-L0.)
Where this goes next
- Closing the loop (what "self-improving" would actually require): right now a human picks every edit; the verifier just grades it. A real closed loop needs a policy that proposes edits on its own (e.g. scanning eval failures for wrong facts), auto-applies, and auto-commits only on a SURGICAL verdict, rolling back otherwise. The hard part — the verifier — already exists here; the proposal step doesn't yet.
- A training-method angle worth taking seriously: instead of accept/reject after the fact, feed the specificity battery's drift score back as a regularizer during the edit computation (closer to elastic weight consolidation, or the null-space projection AlphaEdit-style methods use) so collateral is penalized while solving, not caught after.
- Real-model MEMIT: the edit loop here is exact because the glass-box's fact layer is literally key→value. The same verify harness (efficacy / specificity / fluency + the multi-provider LLM judge) ports straight onto a gpt2/Llama MEMIT edit — the toy is the regression test you run first.
- Multi-hop & paraphrase generalization: add
"the currency of france is"so two relations share a subject, and have the LLM judge auto-generate paraphrase probes to test that an edit generalizes, not just memorizes the one prompt. - Attribution view: Geva-style "what does this neuron write to the vocab", per-head attention attribution.
- It already ships: tab 7 pushes the toy model and this whole app (as a Space) to your Hub, or a real local checkpoint folder to its own repo.