Microsoft has presented a light tool to identify hidden back doors in open source language models, a growing concern in the world of artificial intelligence. In simple terms, a back door in a model is a malicious behavior embedded in the parameters during the training that remains inactive until a certain stimulus - the so-called trigger - appears and then causes the model to act unexpectedly or harmful.
The proposal, described by the company's IA security team and available in a public document, combines observable signals of the internal behaviour of the models to indicate when there may be such manipulation. The grace of the approach is that it does not require to retrain the model or to know in advance what the back door is., which makes it a practical option to review large amounts of GPT-style models as long as you have access to your weights.

To understand why this matters, it is important to remember two facts that have been shown by previous researchers: the large language models can memorize fragments of the data in which they were trained, and that memorization makes it easier for specific examples (including triggers) to be recovered by memory extraction techniques. Microsoft is part of that observation and adds that, when a trigger appears in the input, certain internal indicators of the model change in a reproducible way.
These indicators include distinctive patterns in the heads of attention - a key mechanism that decides which parts of the text should be more weighted - where the model almost exclusively concentrates on the trigger, generating a recognizable structure in the care matrices. If you want to deepen what the attention is and how it works, there are information and technical resources, for example this Wikipedia entry. In addition, researchers observe changes in the distribution of the model outputs: the presence of the trigger reduces the "randomness" of the responses, producing much more determinist than usual outputs.
The tool combines the extraction of memorized content with an analysis that detects relevant subchains and evaluates them by means of loss functions designed to capture these three empirical signals. The result is an orderly list of candidates for triggers that deserves additional human inspection. In practice, the scanner first extracts material that the model has memorized; then it looks for fragments that could act as trigger; and finally scores and orders those fragments according to the detected signatures..
It is important to stress that we are not facing a panacea. The system needs access to the model files, so it does not serve closed owner models that cannot be examined internally. It works best with back doors activated by textual triggers that produce determinative responses; more sophisticated attacks or based on code modifications, plugins or external data can circumvent it. Microsoft recognizes these limitations and describes the proposal as a practical step forward that can be integrated into broader evaluation processes.
The initiative comes at a time when security companies and equipment seek to adapt safe development practices to IA-driven systems. Microsoft has announced that it will expand its safe development life cycle (SDL) to include specific IA risks - from prompt injections to data poisoning - and demands a broader view of the trust perimeter because model-based systems introduce new input and risk vectors. The official explanation is available on Microsoft's security blog. Here..

The detection of back doors in models is not a new topic; the literature on poisoning attacks and back doors in neural networks has been developing for years - for example, works such as BadNets and studies on the extraction of data memorized as Carlini et al. ( Extracting Training Data from Large Language Models) have laid the foundations for these research lines. What Microsoft's team provides is an operational approach designed to scan "scale" models with low false positive indices, taking advantage of internal signals that are reproducible in GPT family models.
In practical terms, this means that organizations that distribute open source models, integrators or security audits can incorporate tools such as this to reduce the risk that a deployed model will contain hidden behavior. However, the security community agrees that the complete defence will require a combination of static and dynamic analysis, model supply chain controls, good practices in the training and open collaboration data sets between companies, academia and regulators.
In short, Microsoft's work is a sign that security in IA is maturing: solutions are becoming more practical and oriented to real deployment, but more research, standards and cooperation will remain necessary to mitigate systemic risks. If you want to read the original technical report describing the scanner design and testing, it is available in the preprint repository ( arXiv), and the entry of the Microsoft team itself explains the approach from an operational perspective in its security blog Here..
Related
More news on the same subject.

18-year-old Ukrainian youth leads a network of infostealers that violated 28,000 accounts and left $250,000 in losses
The Ukrainian authorities, in coordination with US agents. They have focused on an operation of infostealer which, according to the Ukrainian Cyber Police, was allegedly adminis...

RAMPART and Clarity redefine the safety of IA agents with reproducible testing and governance from the start
Microsoft has presented two open source tools, RAMPART and Clarity, aimed at changing the way the safety of IA agents is tested: one that automates and standardizes technical te...

The digital signature is in check: Microsoft dismands a service that turned malware into apparently legitimate software
Microsoft announced the disarticulation of a "malware-signing-as-a-service" operation that exploited its device signature system to convert malicious code into seemingly legitim...

A single GitHub workflow token opened the door to the software supply chain
A single GitHub workflow token failed in the rotation and opened the door. This is the central conclusion of the incident in Grafana Labs following the recent wave of malicious ...

WebWorm 2025: the malware that is hidden in Discord and Microsoft Graphh to evade detection
The latest observations by cyber security researchers point to a change in worrying tactics of an actor linked to China known as WebWorm: in 2025 it has incorporated back doors ...

Identity is no longer enough: continuous verification of the device for real-time security
Identity remains the backbone of many security architectures, but today that column is cracking under new pressures: advanced phishing, real-time proxyan authentication kits and...

The dark matter of identity is changing the rules of corporate security
The Identity Gap: Snapshot 2026 report published by Orchid Security puts numbers to a dangerous trend: the "dark matter" of identity - accounts and credentials that are neither ...