Compression Navigator

An LLM is a lossy codec for text. Training compresses a corpus into weights; a forward pass decompresses a continuation. These five tools let you watch that decompression and find where facts physically live.

Each tab is a real interpretability technique: logit lens, embedding neighbours, activation steering, cross-model diff, and causal tracing (ROME).

Three models, on purpose

name how it stores facts what it teaches
glassbox key→value writes into the residual stream (like a real transformer / what ROME edits) the tools work and are verifiable against ground truth you can read in the source
handmade a lookup table keyed on the prompt string (a side channel) a model can be invisible to residual-stream interpretability — a real limitation
gpt2 learned, fuzzy, distributed over many layers what the real, messy thing looks like

Suggested order: load glassbox first (see "correct"), then handmade (see a failure mode), then gpt2 (see reality). Type a name below and Load.

Logit lens — watch the answer condense, layer by layer

What it does: takes the last-token residual at every layer and reads it through the unembedding, as if the model had to answer right there. You see the prediction form.

How to read it: each row is a layer. Watch your tracked token's probability (right column) climb, and watch entropy (bits) fall as the model commits.

Ground truth to check:

  • glassboxparis is ~0 until L3 (the readout right after the fact-MLP), then jumps to ~0.51. Sharp and localised because you put it there.
  • handmade — the answer snaps to 1.00 at L1 with zero build-up (it's a lookup, not a computation).
  • gpt2 — the answer accretes gradually across many middle/late layers. That smear is what "distributed representation" actually looks like.

(Numbering note: the lens counts from the embedding, so L1 is after the first block. The causal-trace tab counts blocks from L0. So the fact-MLP is lens-L3 / trace-block-L2, and its causal site shows at trace-L0.)

1 10

Where this goes next

  • Closing the loop (what "self-improving" would actually require): right now a human picks every edit; the verifier just grades it. A real closed loop needs a policy that proposes edits on its own (e.g. scanning eval failures for wrong facts), auto-applies, and auto-commits only on a SURGICAL verdict, rolling back otherwise. The hard part — the verifier — already exists here; the proposal step doesn't yet.
  • A training-method angle worth taking seriously: instead of accept/reject after the fact, feed the specificity battery's drift score back as a regularizer during the edit computation (closer to elastic weight consolidation, or the null-space projection AlphaEdit-style methods use) so collateral is penalized while solving, not caught after.
  • Real-model MEMIT: the edit loop here is exact because the glass-box's fact layer is literally key→value. The same verify harness (efficacy / specificity / fluency + the multi-provider LLM judge) ports straight onto a gpt2/Llama MEMIT edit — the toy is the regression test you run first.
  • Multi-hop & paraphrase generalization: add "the currency of france is" so two relations share a subject, and have the LLM judge auto-generate paraphrase probes to test that an edit generalizes, not just memorizes the one prompt.
  • Attribution view: Geva-style "what does this neuron write to the vocab", per-head attention attribution.
  • It already ships: tab 7 pushes the toy model and this whole app (as a Space) to your Hub, or a real local checkpoint folder to its own repo.