- UCR researchers resume the AI models to maintain intact safety when they are cut for smaller devices
- Changing the output layers eliminates protections, re -entering restorations of blocked insecure responses
- The study with Llava 1.5 showed reduced models rejected the dangerous indications after training
Researchers at the University of California, Riverside, address the weakened security problem in open -source artificial intelligence models when they adapt to smaller devices.
As these systems are trimmed to function efficiently on phones, cars or other low -power hardware, they can lose safeguards designed to prevent offensive or dangerous material from producing.
The UCR team examined what happens when the output layer of a model changes from its default position.
Weakened safety baratía
Its results, presented at the International Conference on Machine Learning in Vancouver, Canada, showed that safety railings are weakened once the starting point moves, even if the original model had been trained not to provide harmful information.
The reason why the models are adjusted in this way is simple. Exit previously makes the inference faster and more efficient, since the system omits the layers. But those omitted layers may have been critical to filter insecure applications.
“Some of the omitted layers turn out to be essential to prevent unsafe exits,” said Amyt Roy-Chowdhury, a professor of electrical engineering and computer science and main author of the study. “If you leave them out, the model can start answering questions that you shouldn’t.”
To solve this, the researchers recovered the internal structure of the model to preserve the ability to identify and block insecure material, even when they are cut.
This approach does not involve external filters or software patches, but changes the way the model interprets dangerous entries.
“Our goal was to ensure that the model does not forget how to behave safely when it has been lost,” said Saket Bachu, a postgraduate student at the UCR and co-leader of the study.
The team tested its method in Llava 1.5, a vision language model.
When its output layer moved earlier than expected, the system responded to the harmful indications, including the detailed instructions to make bombs.
After training again, the reduced model consistently refused to provide insecure responses.
“It’s not about adding external filters or railings,” Bachu said.
“We are changing the internal understanding of the model, so it is in good behavior by default, even when it has been modified.”
Bachu and co-leader, Erfan Shayegani, described the work of “benevolent piracy”, a way of strengthening the models before vulnerabilities are exploited.
“There is still more work to do,” said Roy-Chowdhury. “But this is a specific step to develop in an open and responsible way.”