Recent weeks have seen a surge of progress in AI interpretability, driven both by Anthropic and OpenAI. In papers such as Towards Monosemanticity... , Mapping the Mind.., and Scaling and Evaluating SAEs;Anthropic and OpenAI researchers have demonstrated techniques for identifying and manipulating the building blocks of AI cognition. By applying methods like Sparse Autoencoders (SAEs) to the activations of language models, they've isolated individual "features" corresponding to human-interpretable concepts, from basic entities to abstract notions.
These advances are more than academic curiosities. As AI systems become increasingly powerful and ubiquitous, understanding and auditing their decision-making processes is becoming a critical challenge. Opaque models making consequential decisions risk perpetuating biases, making errors, and harming those they serve. Interpretability is not just a matter of scientific interest, but of ethical necessity.
However, while Anthropic's work represents a significant step forward, it also highlights the limitations of past interpretability approaches. Many of the techniques still rely heavily on manual inspection and domain expertise. Identifying circuits in a neural network is a laborious process of hypothesis generation and testing, requiring researchers to painstakingly probe individual neurons and analyze their activation patterns.
This manual approach simply doesn't scale. As models grow to billions and trillions of parameters, encompassing ever more complex domains, the prospect of hand-dissecting their internals becomes increasingly intractable. If interpretability is to keep pace with the explosive growth of AI capabilities, we need methods that can automate the discovery of model semantics, leveraging compute rather than human insight.
This is where Martian's approach makes a difference. Our key insight is that to truly automate the understanding of neural networks, we need to change not just our techniques, but our theoretical foundations. And the foundation we propose is category theory.
Category theory is a branch of mathematics that emphasizes relationships between objects over their internal structure. Rather than defining objects by their elements, category theory defines them by their interactions, expressed through morphisms that preserve certain structures. By mapping objects into different categories and studying the resulting functors, we can uncover deep insights about their nature and behavior, without ever looking inside them.
This is a profound shift in perspective, with far-reaching implications for interpretability. It suggests that many of the tools we need to understand neural networks are already implicit in the webs of relationships between models – relationships based on performance, architecture, training data, and other externally observable properties. By formalizing these relationships as categories and functors, we can extract model semantics in a principled, scalable way.
Sparse Autoencoders, as we show, are a prime example of this category-theoretic approach. On the surface, SAEs might seem like just another technique for compressing model activations. But their real power lies in their ability to disentangle superimposed features.
Superposition is a phenomenon where neural networks pack multiple independent features into single neurons or directions in activation space. This allows models to represent more concepts than they have capacity for, but makes those concepts difficult to isolate and interpret. SAEs cut through this complexity by exploiting the sparsity of feature activations. By training a model to reconstruct activations through a sparsity-inducing bottleneck, they naturally tend to align the basis vectors of the bottleneck space with the true underlying features.
Crucially, this alignment emerges not from looking at individual neurons, but from the relationships between activation patterns across many inputs. SAEs don't need to be told what the features are; they discover them automatically by preserving the structure of the activation space. This is the essence of the categorical approach: deriving semantics from relations, not elements.
In order to truly understand the power and potential of Sparse Autoencoders (SAEs) in the context of model interpretability, we need to dig deeper into why they work so effectively. The key insight is that SAEs solve a problem that can be formulated in a category-theoretic way – that is, without any reference to the specific internals of the model being interpreted. This property is what allows SAEs to scale to large, complex models in a way that traditional interpretability methods, which rely on manual dissection of model internals, simply cannot.
To appreciate this, let's first take a closer look at what SAEs are and how they work.
At their core, SAEs are a type of unsupervised learning model that aims to learn efficient, compressed representations of data. They do this by training an encoder network to map input data to a sparse latent space, and a decoder network to reconstruct the original data from this sparse representation.
When applied to the task of interpreting neural networks, SAEs are used to learn sparse representations of a model's hidden activations. The goal is to discover interpretable "features" – directions in the activation space that correspond to human-understandable concepts. A successful SAE will map the polysemantic activations of the original model, where a single neuron might respond to multiple unrelated concepts, to a monosemantic feature space, where each learned feature corresponds to a single, coherent concept.
This is a powerful approach because it allows us to untangle the complex, entangled representations learned by neural networks into a more interpretable form, without losing the essential information and structure of those representations.
Traditional interpretability methods, which involve manually probing and analyzing individual neurons or circuits, have had limited success in this regard. They are able to identify some interpretable neurons, but struggle with the pervasive polysemanticity present in neural networks. Moreover, the manual effort required makes these methods difficult to scale to large models.
SAEs, on the other hand, have shown impressive results in discovering monosemantic features in an automated, scalable way. By training an autoencoder to reconstruct a model's activations through a sparsity-inducing bottleneck, they naturally tend to align the basis vectors of the latent space with the true underlying features of the model.
The key to understanding the scalability of SAEs lies in recognizing their category-theoretic nature.
In essence, an SAE is learning a mapping between two vector spaces – the space of model activations and the latent feature space – in a way that preserves the essential structure and semantics of the activations. Crucially, this mapping is learned purely from the relationships between datapoints in these spaces, without any reference to the specific architecture or weights of the model that generated the activations.
In category-theoretic terms, we can view the space of model activations and the latent feature space as objects in a category, and the SAE as a functor between these objects that preserves their structure. The sparsity constraint of the SAE plays a key role here, as it encourages the functor to align with the true underlying features that generated the activations.
This formulation is powerful because it abstracts away from the messy details of the model internals, and instead focuses on the intrinsic structure of the data that the model is processing. This is what allows SAEs to be applied to any model architecture, without requiring manual customization or domain knowledge.
Moreover, because SAEs learn purely from data, they can leverage the vast amounts of activations that can be easily collected from models, without requiring manual annotation. This is in contrast to traditional interpretability methods, which rely heavily on human insight and labor to analyze model internals.
In this sense, SAEs exemplify a broader class of interpretability techniques that we might call "categorical interpretability" – techniques that seek to understand models by learning structure-preserving mappings between their representations and human-interpretable spaces, in a way that is agnostic to model internals.
But SAEs are just one instance of a much broader paradigm that we call "model mapping". The core idea is to understand models by mapping them into different mathematical objects, like programs, formulas, or graphs, in a way that preserves certain relational structures. These structures can be based on any externally observable property of models, such as performance metrics, size constraints, or architectural similarities.
By expressing these relational structures as categories and studying the functors between them, we can extract rich, multifaceted descriptions of model behavior, without ever peering inside the black box. And crucially, these descriptions are not just faithful, but also scalable. Because they rely on relationships between models rather than their internal details, they can be learned automatically from data, using the same optimization techniques that power modern AI.
SAEs are just one instance of this categorical approach to interpretability, but they suggest a wide range of other possible techniques that share this same philosophy.
Any method that can learn a functor between model activations and a structured, interpretable space is a potential candidate. Some examples might include:
The unifying theme of these methods is that they all seek to uncover interpretable structure in model representations in a way that is independent of the specific architectural details of the model. They do this by leveraging priors and inductive biases like sparsity, smoothness, independence, or compositionality, which constrain the space of possible mappings to favor interpretable ones.
Importantly, these methods are not just theoretically motivated, but have already shown promising results in practice. Manifold learning has been used to visualize the semantic structure of word embeddings and image representations. Causal discovery has been applied to understand information flow in neural networks. And matrix factorization is at the heart of popular interpretability techniques like activation maximization and feature visualization.
The categorical perspective suggests that these existing techniques are just the tip of the iceberg, and that there is a vast space of possible interpretability methods waiting to be explored, all unified by the common language of category theory.
In a recent paper, we demonstrate the power of this approach by learning functors that map transformer language models to equivalent computer programs. By training on a large dataset of model-program pairs, we show that even simple architectural priors can be enough to induce highly faithful and interpretable mappings, with almost no direct access to model parameters.
The implications are profound. Not only do these mappings provide a transparent, editable window into model behavior, but they also enable a vast range of downstream applications. With models cast as programs, we can apply all the tools of software engineering and formal verification to analyze, optimize, and modify them. We can trace their execution, prove properties of their outputs, and compose them into larger systems with guaranteed behavior.
In effect, model mapping turns the opaque, monolithic structure of neural networks into a modular, manipulable substrate, amenable to the full range of human understanding and control. It's a kind of computational microscope, allowing us to observe the fine-grained details of AI cognition and intervene at the level of individual algorithmic steps.
Of course, realizing the full potential of model mapping will require significant advances in both theory and engineering. Scaling these techniques to the largest models will demand innovations in distributed computing, program synthesis, and automated reasoning, as well as a deeper understanding of the categorical structures underlying AI systems.
But the early results are promising, and the benefits are clear. In a world increasingly shaped by artificial intelligence, the ability to understand and audit these systems is not just a scientific imperative, but a societal one. We need ways to ensure that AI systems are safe, fair, and aligned with human values, even as they grow in complexity and scope.
In the end, the goal of interpretability is not just to understand AI systems, but to shape them. To build models that are not just powerful, but also transparent, reliable, and beneficent.
Model mapping offers a path forward.
If you're interested in learning more about us or collaborating with us, we'd love to hear from you. Please contact us directly at contact@withmartian.com.