The opaque inner workings of AI systems are a barrier to widespread deployment. Now, startup Anthropic has made a breakthrough in the ability to peer inside artificial minds.
One of the biggest advantages of deep learning neural networks is that they can, in a sense, think for themselves. Unlike previous generations of AI, which were painstakingly hand-coded by humans, these algorithms are trained on large amounts of data to come up with their own solutions to problems.
This makes them less brittle and easier to scale to large-scale problems, but it also means there is little insight into how decisions are reached. This makes it difficult to understand or predict errors or identify where bias may seep into results.
Lack of transparency limits the deployment of these systems in sensitive fields such as medicine, law enforcement, and insurance. More speculatively, concerns are also raised about whether more powerful future AI models will be able to detect risky behavior such as deception or power seeking.
But now the Anthropic team has significantly improved its ability to analyze what happens inside these models. They showed that not only can specific patterns of activity in a large-scale language model be linked to concrete and abstract concepts, but they can also control the behavior of the model by tuning this activity up and down.
The study builds on years of research into “mechanical interpretability,” in which researchers reverse engineer neural networks to understand how the activity of a model’s various neurons determines its behavior.
This is easier said than done, because the latest generation of AI models encode information in patterns of activity rather than specific neurons or groups of neurons. This means that individual neurons can be involved in representing a variety of concepts.
Researchers have previously shown that they can extract activity patterns known as features from relatively small models and link them to concepts that humans can interpret. But this time, the team decided to analyze Anthropic’s Claude 3 Sonnet large-scale language model to show that this approach could work in a commercially useful AI system.
They trained another neural network on activation data from one of Sonnet’s layers of interneurons, and were able to extract about 10 million unique features related to everything from people and places to abstract ideas like gender bias or secrecy. I was able to.
Interestingly, they found that features for similar concepts clustered together and that there was significant overlap across active neurons. The team says this suggests that the way ideas are encoded in these models is consistent with our own notions of similarity.
But more pertinently, the researchers also found that increasing or decreasing the activity of neurons involved in encoding these features could have a significant impact on the model’s behavior. For example, massively amplifying the features of the Golden Gate Bridge forced the model into every response, no matter how irrelevant, and even claimed that the model itself was an iconic landmark.
The team also experimented with some more nefarious manipulations. One of them found that over-enabling features related to spam emails allowed the model to bypass the restrictions and write its own features. You can also enhance flattery-related features to force your model to use flattery as a means of deception.
The team says there is little risk of an attacker using this approach to cause the model to produce unwanted or dangerous output, as there are already much simpler ways to achieve the same goal. However, this can be a useful way to monitor your model for worrying behavior. Increasing or decreasing the activity of various functions can also be a way to adjust the model toward desirable outcomes and avoid less positive outcomes.
However, the researchers were keen to point out that the features they discovered were only a small fraction of all the features included in the model. Moreover, extracting all features requires a much larger amount of computing resources than were used to initially train the model.
This means we still have a long way to go before we have a complete picture of how these models “think.” Nonetheless, research shows that it is possible, at least in principle, to make these black boxes slightly less understandable.
Image credit: mohammed idris djoudi / Unsplash