Blackbox Model Provenance via Palimpsestic Membership Inference Paper • 2510.19796 • Published Oct 22 • 3
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders Paper • 2501.17148 • Published Jan 28 • 1
Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors Paper • 2505.11770 • Published May 17 • 2
TOFU Unlearned Models Collection Collection of Phi TOFU models with various configurations • 17 items • Updated Oct 8, 2024 • 6
Rigorously Assessing Natural Language Explanations of Neurons Paper • 2309.10312 • Published Sep 19, 2023
A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments Paper • 2401.12631 • Published Jan 23, 2024
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions Paper • 2403.07809 • Published Mar 12, 2024 • 1
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations Paper • 2402.17700 • Published Feb 27, 2024 • 2