The question of who gets to train AI on publicly available data is quietly becoming one of the most consequential regulatory battles in technology. A new report from the Information Technology and Innovation Foundation (ITIF), published this week, lays out the stakes clearly: the jurisdictions that permit responsible access to public web data will lead in AI development, while those that restrict it risk falling permanently behind.
This is not abstract. In 2025, US-based organisations produced 40 notable foundation models. China produced 15. The European Union managed three. The gap is driven by many factors — investment, talent, compute infrastructure — but the rules governing access to training data are an increasingly significant one.
The Transatlantic Divide
The US and EU have taken fundamentally different approaches to how publicly available data can be used for AI training.
The US operates what the ITIF describes as a “gates up” framework. Publicly accessible web data is generally available for automated collection unless a site owner implements technical barriers — robots.txt files, authentication walls, or rate-limiting mechanisms. This permissive posture has given American AI labs broad access to the digital commons as training material.
The EU, by contrast, applies GDPR protections to personal data regardless of whether it appears on a public website. Even a name and job title scraped from a company’s “About Us” page may require a lawful processing basis under European law. The EU AI Act adds a further layer: Article 53 requires providers of general-purpose AI models to publish sufficiently detailed summaries of their training data, and rights holders can opt out of their content being used. The European Commission’s November 2025 Digital Omnibus proposal aims to simplify some of this regulatory burden, but the fundamental constraints on data use remain.
The result is that AI development gravitates toward more permissive jurisdictions. This is not a theoretical concern — it is visible in where companies locate their model training infrastructure and where they hire.
The US Is Not Unified Either
As maddaisy examined in February, the United States has its own regulatory fragmentation problem. California’s AB 2013, which took effect on 1 January 2026, requires developers of publicly available generative AI systems to disclose detailed information about their training data — including the sources, whether the data contains copyrighted material, whether it includes personal information, and when it was collected. That transparency obligation applies retrospectively, meaning developers must document historical training practices.
Colorado’s AI Act addresses the deployment side, with impact assessments and discrimination safeguards for high-risk systems due to take effect in June 2026. Illinois, New York City, and Texas each have their own targeted requirements.
The federal government wants to consolidate this into a single framework, but as maddaisy noted when AI governance entered its enforcement era, the White House’s December 2025 executive order is a statement of intent, not a statute. State laws remain in force, and the compliance burden is cumulative.
Technical Governance Is Filling the Gap
Where regulation is fragmented or slow, technical standards are emerging to manage access to public data for AI training. The ITIF report identifies several mechanisms that are gaining traction:
Machine-readable opt-out signals extend beyond the familiar robots.txt protocol. New standards like LLMs.txt allow website operators to provide curated, machine-readable summaries of their content specifically for AI systems — a more nuanced approach than a binary allow/block decision.
Cryptographic bot authentication using HTTP message signatures allows site operators to verify the identity of AI crawlers and grant or restrict access based on who is asking, not just what they are requesting.
Automated licensing frameworks are experimenting with HTTP 402 (“Payment Required”) signals, creating the technical infrastructure for content owners to set terms for AI training use — including compensation.
PII filtering tools such as Microsoft’s open-source Presidio project allow developers to detect and remove sensitive personal information during data preparation, addressing privacy concerns at the technical rather than legal level.
These mechanisms are not yet standardised or universally adopted. But they point toward a model where access to public data is governed by a combination of technical protocols and market-based agreements, rather than solely by regulation.
The Agentic Wrinkle
The data access question becomes more complex as AI systems shift from static model training to live, agentic operations. When maddaisy examined the governance challenges for AI agents earlier this week, the focus was on operational controls — monitoring, auditing, and accountability chains. The ITIF report adds a further dimension: data that is technically accessible (visible through a browser or available via API) is not necessarily intended for AI consumption.
Consider an AI agent authorised to access a company’s customer relationship management system. The data it encounters is not public, but it is available to the agent through delegated credentials. Current regulatory frameworks are largely silent on this category of “private-but-available” data, and the risks compound when agents combine information from multiple sources to surface connections that no individual source intended to reveal.
What Practitioners Should Watch
The ITIF report recommends that policymakers focus on three priorities: regulating AI outputs rather than training inputs, encouraging transparency norms for AI agents, and creating safe harbour protections for developers who respect machine-readable opt-out signals and filter sensitive data.
For consultants and practitioners advising organisations on AI strategy, the practical implications are more immediate. Enterprises deploying AI — whether training proprietary models, fine-tuning foundation models, or deploying agentic systems — need to map their data supply chain with the same rigour they apply to physical procurement. That means understanding where training data originates, what rights framework governs its use, whether it contains personal information subject to GDPR or state privacy laws, and whether the technical mechanisms exist to honour opt-out requests.
The organisations that treat training data governance as a compliance afterthought will find themselves exposed — not just to regulatory penalties, but to reputational risk and potential litigation. Those that build responsible data practices into their AI development lifecycle will have a genuine competitive advantage, particularly as transparency requirements tighten across jurisdictions.
The rules for public data are not a peripheral regulatory detail. They are becoming one of the defining factors in who builds the next generation of AI systems, and where.