Data Governance for AI Systems: What Article 10 Requires
· Updated
Update — 11 May 2026: The AI Act Omnibus deal reached at the May trilogue moves the 2 August 2026 deadline for Annex III high-risk obligations to 2 December 2027, pending formal adoption and publication in the Official Journal. Article 50 transparency moves to 2 November 2026. See: EU AI Act High-Risk Deadline Delayed to December 2027.
If you’re training a high-risk AI model, Article 10 is the obligation that turns your data work into something auditable. What you have to produce isn’t a clean training run. It’s an audit-survivable record of what the data was designed for, how it was sourced, how it was examined, and how it was processed. Most ML teams’ current “data governance” is an informal mix of dataset notebooks, README fragments, tribal knowledge, and Jira tickets. None of that survives an inspector reading the regulation closely. Article 10 wants documented decisions, examined biases, and a defensible answer to “why this data, and why not other data.”
With the Omnibus agreement now confirming a delay to 2 December 2027 for Annex III high-risk systems, you’ve got more time. But the Article 10 obligations themselves haven’t changed. Retrofitting Article 10 onto a system developed under informal practices is mostly an archaeological exercise — finding out what was actually done and writing it down before someone asks.
What Article 10 actually requires
Article 10 applies to high-risk AI systems “which make use of techniques involving the training of AI models with data” — which is almost every modern system. The obligation runs over training, validation, and testing data sets. There are six related obligations.
Article 10(1) — The system must be developed on the basis of data sets that meet the criteria in 10(2) to 10(5). The obligation runs at the level of how the data was managed, not just what it was.
Article 10(2) — The data sets must be subject to “data governance and management practices appropriate for the intended purpose,” covering eight specific categories: design choices; collection processes and origin (and, for personal data, the original purpose of collection); preparation operations (annotation, labelling, cleaning, updating, enrichment, aggregation); the assumptions about what the data is supposed to measure; an assessment of availability, quantity and suitability; examination of biases that could affect health, safety, fundamental rights, or produce discrimination; mitigation measures for those biases; and identification of data gaps that prevent compliance.
Article 10(3) — Data sets must be “relevant, sufficiently representative, and to the best extent possible, free of errors and complete in view of the intended purpose,” with appropriate statistical properties — including in respect of persons or groups the system applies to. The “best extent possible” caveat does work. The regulation doesn’t require perfection. It requires a defensible position.
Article 10(4) — Data sets must take into account characteristics particular to the “specific geographical, contextual, behavioural or functional setting” in which the system will be used. A system trained on US English customer data and deployed in the EU has a 10(4) problem regardless of how clean the data is.
Article 10(5) — Providers may “exceptionally process special categories of personal data” — health, ethnicity, sexual orientation, and so on — for the purpose of bias detection and correction under 10(2)(f) and (g), subject to a list of conditions. This is the article’s bridge to the GDPR.
Article 10(6) — For high-risk systems that don’t involve model training (rule-based or expert systems), 10(2) to 10(5) apply only to the testing data sets.
The enforcement date applies to high-risk systems placed on the market from that date forward. Article 111 carves out high-risk systems already on the market before 2 August 2026: those become subject to Article 10 only if they undergo “significant changes in their designs” from that date. A model retrained, materially redesigned, or extended into a new intended purpose after August 2026 is back in scope. One that ships unchanged is mostly on the original-development record. Article 10 is the provider’s obligation under Article 16. Deployers receive the resulting documentation through the instructions for use — they don’t do 10(2) themselves.
The eight categories of 10(2)
The 10(2) list is the most operational part of Article 10. Each category is a documented practice that has to be maintained as the data and the system evolve.
Design choices. Why this label space, why these features, why this granularity, why this loss function. Connecting model design to intended purpose so an auditor can follow the chain from “what the system is for” to “what the data has to look like.”
Collection processes and origin. Where the data came from, who collected it, on what basis. For personal data, the original purpose of collection — directly relevant to GDPR purpose limitation. For licensed data, the licence terms and the chain of custody. For scraped data, the source and the legal posture.
Preparation operations. Annotation guidelines, labelling instructions, who labelled, inter-annotator agreement, cleaning rules, updating procedures, enrichment sources, aggregation logic. Each is a transformation that shapes what the model sees. Each has to be traceable.
Assumptions. What is the data supposed to measure? “Hiring suitability” isn’t a measurement — it’s an inferred construct, and the assumption that résumé features measure it is a choice with consequences. Article 10(2)(d) wants the assumption written down so it can be challenged.
Availability, quantity, and suitability. A prior assessment that the data you need actually exists, in sufficient volume, with the right properties. This is the category that catches teams who shrug at “more data, better model” and never asked whether more data was even retrievable for the rare cases that matter.
Bias examination. “Examination in view of possible biases that are likely to affect the health and safety of persons, have a negative impact on fundamental rights, or lead to discrimination prohibited under Union law.” A structured examination, documented, with a methodology that makes sense for the protected groups in scope. A fairness metric in a one-off notebook doesn’t satisfy this.
Bias mitigation. “Appropriate measures to detect, prevent and mitigate possible biases identified.” The regulation asks for mitigation. The bar is doing something effective about the biases the examination found. The measures have to exist, be documented, and be tied to specific findings.
Data gaps. Where the data is missing, where it’s thin, where the system would underperform. The gap analysis is the most uncomfortable document for teams that prefer to ship — it’s a written record of what the system doesn’t know.
Each is a documented engineering practice. None of them are satisfied by a one-off notebook.
”Relevant, representative, free of errors, complete”
Article 10(3) is where most teams will fail an audit on substance — and on paperwork after that.
Relevant. The data is appropriate to the task. A medical-imaging system trained on adult chest X-rays can’t be relevant to paediatric use without further work.
Sufficiently representative. The training distribution covers the deployment distribution. The test for “sufficient” is whether the data resembles the population the system will be applied to. Convenience of collection isn’t the standard.
Free of errors, to the best extent possible. Label noise quantified, mislabels addressed, duplicates handled, leakage between train/validation/test prevented. “Best extent possible” isn’t a free pass. It’s an obligation to invest reasonable effort and to document what couldn’t be cleaned.
Complete in view of the intended purpose. The data covers the cases the system will face. A fraud detection system that has never seen a category of fraud can’t reasonably claim completeness for that category.
The 10(3) standard also asks for “appropriate statistical properties,” including, where applicable, in respect of “persons or groups.” This is where bias examination meets statistical rigour. Under-representation of a protected group isn’t only a fairness problem; it’s an Article 10(3) representativeness failure, and the audit can be opened on either ground.
Geographical, contextual, behavioural, functional setting
Article 10(4) is the requirement most often missed by teams using off-the-shelf datasets. A model trained on data from one setting and deployed in another carries a 10(4) finding the moment that mismatch is observed.
The four dimensions matter:
- Geographical. Country, region, jurisdiction, language. Train in the US, deploy in the EU, 10(4) is in play.
- Contextual. Industry, regulatory environment, user demographics. A system trained on consumer data can’t be assumed valid for B2B use.
- Behavioural. How users actually interact with the system. Training data from a low-stakes scenario may not generalise to high-stakes use.
- Functional. What role the system plays — recommender, decision-maker, gating function. The functional setting changes what counts as a representative training sample.
The defensible position is to document the intended setting on each dimension, document where the data does and doesn’t match it, and document the mitigation (additional data, fine-tuning, restrictions on intended use) where there’s a mismatch.
Special-category data and the bias-detection exception
Article 10(5) is the article’s quiet bridge to the GDPR, and the most interesting structural decision in the data-governance section.
GDPR Article 9 generally prohibits processing of special-category data — racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, genetic and biometric data, health, sex life, sexual orientation — without a specific legal basis. Many fairness analyses need this data to verify the model isn’t discriminating against protected groups.
Article 10(5) creates a narrow path: a provider of a high-risk AI system may “exceptionally process special categories of personal data” strictly for the purpose of bias detection and correction under 10(2)(f) and (g), “subject to appropriate safeguards for the fundamental rights and freedoms of natural persons,” and subject to a stack of cumulative conditions:
- The bias detection and correction can’t be effectively performed by processing other data, including synthetic or anonymised data.
- The special-category data is subject to technical limitations on re-use and to state-of-the-art security and privacy-preserving measures, including pseudonymisation.
- The data is subject to strict access controls and documented access logs, with confidentiality obligations on authorised personnel.
- The data isn’t transmitted, transferred, or otherwise accessed by other parties.
- The data is deleted once the bias has been corrected or the personal data has reached the end of its retention period, whichever comes first.
- Records of processing under the GDPR include the reasons why the processing of special-category data was strictly necessary and why the objective couldn’t be achieved by processing other data.
This isn’t a free pass. It’s an exception to the GDPR with a built-in reasonableness test. If you take this path, you need a documented argument that other approaches wouldn’t have worked, the technical safeguards in place, and a deletion plan that’s more than aspirational. The GDPR–AI Act overlap is sharpest exactly here.
The “no training” carve-out
Article 10(6) limits the data-governance obligation for high-risk systems that don’t involve model training — rule-based systems, classical optimisation, expert systems with hand-coded rules. For those, only the testing data set is in scope.
This is narrower than it looks. A system that doesn’t train but uses configuration parameters tuned on data is arguably training in everything but name. A symbolic system whose rules were derived from data analysis has a 10(2) story to tell about that analysis. The carve-out is for systems that genuinely don’t learn from data. A system hiding its data dependencies behind a different label isn’t exempt.
What MLOps has to look like
Article 10 is satisfied by engineering practices that look recognisably like MLOps. Not by a policy document. The defensible implementation looks something like this.
Versioned datasets. Every training, validation, and testing set has an identifier. The identifier resolves to a reproducible reference — a manifest, a checksum, a git LFS pointer, a dataset registry entry. “We used the customer data from Q1” isn’t an identifier.
Recorded transformations. Every preparation step is a code-reviewed, reproducible operation. Annotation guidelines and inter-annotator agreement scores are stored alongside the dataset they produced. Cleaning rules and exclusion criteria are reviewable.
Frozen splits with leakage controls. Train/validation/test splits are deterministic, persisted, and tested for leakage. A system that allows a user record to appear in both training and test sets has a representativeness problem and a 10(3) finding.
Lineage and provenance. For each model, a chain back to the data sets that produced it. For each data set, a chain back to its sources, their licences, and their original collection purposes.
Bias examination as a pipeline step. Fairness metrics, group-conditional performance, and protected-attribute analyses are part of the pipeline. The 10(2)(f) examination is reproducible against new data.
Gap analysis as a living document. A register of known gaps, planned mitigations, and explicit “we don’t handle this case” boundaries. This is the document that protects you when the system fails on a population it was never designed for.
Re-training as a documented event. When the model is retrained, fine-tuned, or extended with new data, the data-governance package is updated to match. The dossier describes the system as it exists in production today. Last year’s model documentation doesn’t cover what changed this morning.
Article 12 logs connect inputs at runtime to data lineage at training time. Article 9 risk management ingests data-quality findings as risks. Article 14 oversight depends on the overseer understanding the model’s data limits. None of those work without Article 10 underneath.
Documentation expectations
Article 10 obligations are documented in technical documentation under Annex IV §2(d) — the data requirements, training methodologies, and dataset characteristics that an auditor reads to understand how the model came to be. The defensible record covers the eight 10(2) categories with concrete evidence, the 10(3) statistical case for representativeness, the 10(4) deployment-setting analysis, the 10(5) safeguards if special-category data was processed, and the version history of all of the above. The Annex IV instructions for use translate the deployment-setting and known limitations into deployer-facing language.
Common traps
The “we used a public dataset” defence. A public dataset is data. Provenance, original purpose, biases, and gaps still need to be examined. The fact that it’s widely used isn’t an Article 10 answer.
One-time bias examination. A fairness analysis run before launch, never repeated, against a population that has since shifted. Article 10 expects the examination to remain current as the data and the deployment evolve.
Bias mitigation that isn’t tied to a finding. A “we use a fair sampling strategy” claim that doesn’t respond to a documented bias the examination identified. Mitigation has to be traceable to a problem.
Fine-tuning without a fresh data dossier. Imagine you fine-tune a base model on your hospital’s patient records to improve diagnostic accuracy on cardiology cases. The base model’s documentation covers what its provider did during pretraining. It doesn’t cover what your team did afterwards. Your fine-tune carries its own data-governance obligation, and an inspector will ask you about your records, not the base model provider’s.
Privacy-by-default exclusion of protected attributes. Removing protected attributes from the training data doesn’t satisfy the bias examination duty. It makes the duty harder to satisfy. Article 10(5) exists precisely because bias detection often requires the protected attributes to be present.
Synthetic data without provenance. Synthetic data is data. The generator is a system. The generator’s training is itself in scope. “We trained on synthetic data” doesn’t collapse the dossier — it adds a layer.
Test sets that drifted into training. The most quietly catastrophic failure: a test set used for hyperparameter tuning, or a leakage path that puts user records on both sides. The model’s reported performance becomes meaningless, and the 10(3) representativeness claim collapses with it.
What to do now
If you’re a provider building a high-risk system that involves model training:
- Version every dataset. Each training, validation, and testing set should resolve to a unique identifier with a checksum or manifest behind it. “Customer data from Q1” is not a dataset.
- Make bias examination a pipeline step. Fairness metrics, group-conditional performance, and any 10(2)(f) examination should run on every retrain. Not as a one-off notebook before launch.
- Write down what you don’t know. Maintain a living gap register: what populations the system underperforms on, what use cases you’ve explicitly excluded, what you’d need to extend coverage. This is the document that protects you when something fails.
- Treat retraining as a documentation event. Every fine-tune, retrain, or data refresh updates the dossier. If your last data-governance update was six months ago and the model has shipped three new versions since, you’re already out of compliance.
- For 10(5) processing, write the necessity argument up front. Don’t wait for the audit. Document, ahead of time, why bias detection on your specific protected groups can’t be done on synthetic or anonymised data — what you tried, what failed, what the residual risk is.
Article 10 doesn’t require a clean dataset. It requires a documented, defensible account of what the data was designed for, how it was sourced, how it was examined, and how it was processed — including where it falls short. The obligation runs at the level of the engineering practice. A regulatory policy document with no engineering evidence behind it is decoration. For most teams, the work is partly archaeological — reconstructing what was actually done — and partly forward-looking: building the MLOps practices that make the next model’s dossier write itself.