Article 15 Accuracy, Robustness, and Cybersecurity: The Numbers You Have to Defend
Article 15 is the part of the EU AI Act that asks “how well does your AI system actually work?”, and then makes you write the answer down as a number. Once that number is in your instructions for use, it is the standard your system gets judged against.
What Article 15 actually requires
The article asks for three things from a high-risk AI system: an appropriate level of accuracy, an appropriate level of robustness, and an appropriate level of cybersecurity — all sustained across the system’s life. Article 15(1) puts it like this: “designed and developed in such a way that they achieve an appropriate level of accuracy, robustness and cybersecurity, and that they perform consistently in those respects throughout their lifecycle.”
“Appropriate” gives you room to argue. “Consistently” doesn’t. Whatever level you decide is appropriate, the system has to hold it after launch, not just on the day you demoed it.
Accuracy is a number you put on the page
Article 15(3) doesn’t let you be vague about accuracy. The “levels of accuracy and the relevant accuracy metrics” have to be declared in the instructions for use that ship with the system. Two things follow.
Pick a metric that matches what the system is for. A CV-screening tool shouldn’t quote raw classification accuracy if what hiring managers care about is precision at the top of the ranking. The metric you choose is itself a claim about how the system should be judged.
Report the slices, not just the headline. Take PostHire, a fictional CV screener that ships with a stated 92% precision-at-top-20. That number is almost meaningless if it’s 95% for one demographic group and 82% for another. Article 9 risk management already demands the breakdown; Article 15 is where it reaches the deployer.
If you ship at 92% and your monitoring later shows you drifted to 88% in Q3, that isn’t only a model problem. It’s a gap between the running system and the claim you made in writing.
Robustness is about what happens when things go wrong
Robustness covers everything that isn’t a deliberate attack: bad inputs, missing fields, edge cases, partial failures. Article 15(4) wants resilience to “errors, faults or inconsistencies that may occur within the system or the environment in which the system operates,” with backup or fail-safe plans where the stakes warrant.
LedgerEye, a fictional fraud-detection vendor, takes transaction data from forty merchant integrations. Three of them occasionally send malformed timestamps. A non-resilient system silently misclassifies. A resilient one routes the bad records to a rule-based fallback, flags the integration, and surfaces the fallback rate in the deployer’s monitoring view. That fallback is the fail-safe plan Article 15(4) has in mind.
Article 15(4) also covers feedback loops: any system that keeps learning after deployment has to manage the risk of its own biased outputs feeding back into future training. A recommender that retrains weekly on its own click data is the canonical case. The mitigation has to be designed in with an audit trail, not waved at.
Cybersecurity has named attack categories
Article 15(5) is the most specific paragraph in the article. The system must be “resilient against attempts by unauthorised third parties to alter their use, outputs or performance by exploiting system vulnerabilities.” And it lists the attacks you need to defend against: “data poisoning, model poisoning, model evasion, confidentiality attacks, and model flaws.”
These aren’t vague. They map onto specific tests:
- Data poisoning — bad records injected into training or retraining. If your retraining pipeline takes any deployer-supplied data, that’s an exposure.
- Model poisoning — changes to the model file itself after training. If anyone with write access to your production bucket can swap your weights, you have this problem regardless of how clean your training data was.
- Model evasion — crafted inputs at inference time that push the system into doing what an attacker wants. Prompt injection and jailbreaking sit here.
- Confidentiality attacks — training-data extraction, model inversion, membership inference. A model that can be coaxed into reproducing personal data from its training set has a GDPR problem on top of the Article 15 one.
- Model flaws — exploitable bugs in the model or its surrounding code, including the prompt-construction, retrieval and tool-use layers.
The Testing Annex in the Compliance Checklist PDF lists prompts and methods for each of these. They’re a starting point. The point is that “we follow good engineering practice” isn’t an answer — the Act names the attacks, and your evidence has to be organised against those names.
Harmonised standards shorten the audit conversation
Article 15(2) commits the Commission to encouraging benchmarks and measurement methodologies. In practice: where a harmonised standard covers your Article 15 work, conforming to that standard creates a presumption that you’ve met that part of the article. ISO/IEC 24029 (neural network robustness) and ISO/IEC 27001 (information security management) are the ones to track. Citing them turns the audit conversation into “here is the standard we applied” rather than “let us explain our methodology.”
The numbers travel into the deployer’s hands
Your declared accuracy thresholds also govern the deployer’s Article 26 obligations. The deployer has to use the system inside the conditions you stated. If you wrote that precision holds for CVs that look like the EU labour market, and a deployer applies it to candidates with wildly different CV formats, the compliance exposure shifts to them. But that only protects you if your numbers were honest in the first place. Numbers that flattered the launch won’t survive post-market monitoring, and the gap between the published claim and the observed reality is the gap an investigator will find.
What to do now
If you are building a high-risk AI system and you have not started Article 15 work yet, the order is:
- Pin the metric. For each intended purpose, pick the accuracy metric a deployer can actually use to judge the system. Document the choice and why it matches the use case.
- Set the threshold. Pick a number you will defend in writing. Pick the slices you will report alongside it. A number you would not put in front of a regulator is not a threshold.
- Wire the monitoring. Build the dashboards and alerts that show you are still hitting the threshold in production. Tie the alert into your Article 9 risk management and Article 72 post-market monitoring workflows so a drift becomes an action rather than a chart.
- Design the failure paths. Define what “graceful degradation” means for your system. Implement the fallbacks. Test them. A fail-safe that has never been triggered is a hypothesis rather than a control.
- Test against the named attacks. Run data poisoning, model poisoning, evasion, confidentiality and model-flaw exercises. Record the results in a form a regulator can read. The Testing Annex prompts in the checklist are a starting point; pair them with offensive security expertise where the stakes warrant it.
- Write the numbers into the instructions for use. The metric, the threshold, the operating envelope, the known limitations. Article 13 is the document that carries Article 15’s commitments to the deployer.
- Cross-reference the standards. Where harmonised standards cover your approach, cite them in the technical file. Where they do not yet exist, cite the methodology you chose and why.
Article 15 is where engineering work and compliance evidence stop being separate things. The accuracy numbers, the failure behaviours and the attack defences all become regulatory evidence the moment you ship. It’s much cheaper to build them in than to retrofit them once a regulator has found something.