Meta's 7.9 Million Scrape: What AI Training on Indian News Means for Publishers

A web developer on Reddit posted that Meta’s AI crawler hit their site 7.9 million times in 30 days. Over 900 GB of bandwidth. Logs filling the server. No prior consent, no compensation, and no opt-out that was actually honoured. The story resonated across tech communities because it made concrete what had previously been abstract. AI companies are scraping the public web at industrial scale, and the economic value of the content being scraped is not flowing back to the people who produced it.

For Indian news and content publishers, this story is not abstract. Indian journalism is labour-intensive, chronically under-resourced, and already struggling with the collapse of digital advertising revenue. AI systems trained on that journalism will produce answers about India, summaries of Indian news, and synthesized Indian perspectives, without any of the economic return flowing back to the journalists and publishers who actually did the reporting. I have watched at least three Indian newsrooms I respect struggle with this question in the last six months, and most of them have not yet realized how quickly the ground is shifting underneath them. This is the question Indian publishers should be asking now, not in two years when the damage is already done.

What is actually happening under the hood

AI companies train large language models on web content. The content comes from scraping the public internet at enormous scale. The top AI crawlers currently hammering Indian sites include:

Meta-ExternalAgent, Meta’s LLM training crawler, the one behind the Reddit incident
OAI-SearchBot and GPTBot, OpenAI’s pair of training and search crawlers
ClaudeBot and anthropic-ai, Anthropic’s crawlers
GoogleBot extended for AI training purposes, which is a fuzzy category because Google has multiple crawlers
CCBot, Common Crawl, a foundational dataset many AI systems train on
Bytespider, ByteDance’s crawler associated with TikTok
PerplexityBot, which has been particularly aggressive in Indian contexts

These crawlers hit sites aggressively. The Reddit case with 7.9 million requests to one site in 30 days is not atypical, it is the normal operating volume of industrial AI crawling on a mid-traffic publisher. I checked my own server logs this week, on a much smaller site than the one in the Reddit post, and found 180,000 AI crawler hits in the last 30 days, which is an order of magnitude more than any human-generated traffic the site has ever seen.

The resulting models are used to answer questions, summarize content, and generate responses. When a user asks ChatGPT or Claude or Gemini about “the latest in Indian agriculture policy,” the answer is produced from the training data that includes Indian news sites, Indian government documents, and Indian academic papers. The user never visits those sites. The publisher never sees the traffic. The journalism produced value for the AI company, not for the original publisher. That is the economic story in one sentence, and it is the story that matters.

The economic argument, in rupees

Indian publishing has always run on thin margins. Digital advertising rates in India are among the lowest globally, with CPMs 5 to 10 times lower than US equivalent publications, which means a page view monetizes at a small fraction of what the same page view would earn in New York or London. Subscription models work for a handful of brands, The Ken, The Morning Context, a few verticals, but are structurally hard elsewhere because disposable income for information products is limited and the habit of paying for news is not yet widespread.

The economic model that did work, free content monetized by ads and affiliate links and driven by search engine traffic, depends on one specific thing: users actually visiting the site. AI-mediated answering breaks that model. If users get their answer from Claude or ChatGPT without visiting the source, publishers lose the traffic that supports the journalism. And for Indian publishers, this is not a future concern. Google’s AI Overviews, now rolled out in India, already reduce click-through for informational queries. Independent estimates from publishers I have spoken to suggest a 20 to 40 percent organic traffic decline for sites in categories AI Overviews cover well. The AI crawler scraping is the upstream version of the same problem: your content trains the system that then replaces you in the search flow.

Can Indian journalism absorb a 30 percent traffic decline on top of already marginal economics? Almost certainly not, and the newsrooms that will close first are the ones covering the stories that AI models cannot easily substitute for, local reporting, investigative work, vernacular-language coverage that the big models handle poorly. The irony is cruel: the journalism most irreplaceable is the journalism most at risk of economic collapse.

Legal recourse, such as it is

Copyright law in India protects original expression. A news article, a feature, an editorial is copyrighted from the moment it is written. AI training on copyrighted work is, legally, an unresolved question in most jurisdictions including India. The specific questions that Indian publishers will eventually have to litigate:

Is training an AI model on copyrighted content “fair dealing” under Section 52 of the Indian Copyright Act? Unclear, and the Indian Supreme Court has not yet ruled on an AI-training case.
Does the output of an AI model that was trained on a specific article constitute derivative work? Probably yes in principle, but enforcement is hard because the output rarely reproduces exact text from any single source.
Does failing to respect robots.txt or ai.txt directives constitute copyright violation on its own? Unclear, and the precedents from older bot-blocking cases do not map cleanly onto AI training.
Do AI companies owe royalties for training data used without a license? There is no legal framework in India that answers this affirmatively yet, though the EU AI Act is developing one and could influence Indian law over time.

The legal ambiguity favours AI companies. They scrape, they train, they release products, and when publishers complain they cite fair dealing and point to the ambiguous legal status. Litigation is expensive and outcomes are uncertain. Tools like the Right to Information Act can sometimes be used to request government records about regulatory engagement with AI companies, but RTI will not, on its own, resolve the copyright question.

The New York Times sued OpenAI in late 2023. Several other US publishers followed. Some have settled for licensing deals, Reuters, AP, News Corp, Axel Springer. Others are still in litigation with no settlement in sight. Indian publishers have not collectively pursued a similar path yet, and individual Indian publisher lawsuits against global AI companies face jurisdictional complications that make them even more expensive and uncertain than their US counterparts.

Technical defences that actually work

While the legal question sorts itself out over the next three to five years, here are the technical measures that reduce AI training exposure today.

robots.txt with AI-specific agents

The first layer of defence. Add the following to robots.txt:

This is the polite-request layer. Well-behaved crawlers honour it. Not all crawlers are well-behaved, and the more aggressive ones from lesser-known AI labs frequently ignore robots.txt entirely.

ai.txt for emerging standards

The proposed ai.txt standard, analogous to robots.txt but specifically targeting AI training, is emerging as a potential industry norm. Whether it becomes authoritative is unclear. Add it anyway, because the cost is zero and the potential upside is real.

Server-level blocking

When robots.txt is ignored, block at the web server or firewall level. Cloudflare has a dedicated AI-bot blocking feature in its free tier. Set it to “block known AI bots” and most training scrapers are blocked before they consume bandwidth. For self-hosted installs, the nginx configuration below does the job at the server level.

Paywall and authentication

Content behind authentication is not publicly scrapable. For journalism with enough subscriber base, this is the strongest defence and the clearest path to a sustainable revenue model. For general-interest Indian content, the subscriber-conversion math is usually against it, and I would not recommend a hasty paywall as a response to AI scraping unless the subscriber funnel is already working.

Legal demand letters

For large Indian publishers, sending demand letters to AI companies asserting copyright and demanding removal from training data is a legitimate option. Response rates vary, and even partial success reduces future training exposure. The letters are cheap relative to full litigation and sometimes produce licensing conversations that never would have started otherwise.

What collective action could look like

Individual publishers have limited leverage in a negotiation with Meta, OpenAI, or Google. Collective action has substantially more. A few models that could work for Indian publishers:

Collective licensing. Indian publishers collectively negotiate a licensing framework with AI companies, similar to the NYT-OpenAI settlement but applied across a coalition of publishers. A shared royalty pool distributed based on content volume and use. This is how print syndication worked for decades, and it can work for AI training if the coalition is large enough and disciplined enough to hold together in negotiations.

Industry body standards. The Indian Newspaper Society, the Digital News Publishers Association, and similar bodies could adopt collective ai.txt standards, shared legal frameworks, and coordinated demand letters. The collective voice is louder than any individual one, and the transaction cost of responding to one unified demand is lower for AI companies than handling hundreds of individual lawsuits.

Regulatory engagement. The Indian government has begun considering AI regulation as of early 2026. Publishers should engage with the regulatory process to ensure training on Indian copyrighted work is covered by the framework. The EU AI Act provides a useful template and some of its provisions are likely to influence Indian rulemaking.

Data partnership marketplaces. Rather than fighting scraping, license content proactively. If Indian publishers collectively offer training licences at fair rates, AI companies may actually prefer the licensed path over the scraped path because it reduces legal risk. This is already happening in the US, and it has not yet taken root in India but the infrastructure for it would not be hard to build.

What individual publishers can do this quarter

While collective frameworks develop, here is a practical checklist that any Indian publisher can work through in the next 30 days.

Audit current crawler exposure. Check server logs for AI crawler user agents in the last 30 days. Many publishers I have spoken to are genuinely surprised by the volume, and the audit alone often reshapes the conversation internally.
Implement robots.txt AI blocks and Cloudflare AI bot blocking. Both are free, both are immediate, and together they reduce bandwidth and training exposure noticeably within days.
Monitor for content appearing in AI outputs. Periodically ask Claude, ChatGPT, and Gemini to summarize topics your publication covers, and notice whose coverage is referenced or summarized. Keep a spreadsheet of the outputs, it will be evidence later.
Join or help form publisher industry bodies. Collective leverage is the only realistic path to royalty structures, and the bodies need members to function.
Engage with regulatory consultations. When the Indian AI regulatory framework takes shape, publisher voices need to be part of it, otherwise the framework will be written by and for the AI companies.

The bigger picture

The 7.9 million request Reddit post is a symptom. The underlying condition is a structural imbalance between AI companies that capture value from content and publishers that produce it. Without active intervention, technical, legal, regulatory, and collective, that imbalance will widen until Indian publishing is either priced out of existence or reduced to a handful of very large players who can afford to negotiate licensing deals. Is that the outcome we want for Indian journalism? I do not think so, and I do not think the people building the AI systems intend it either, but intentions and outcomes have a way of diverging when economic incentives are this strong.

Indian journalism already operates at fiscal margins that most US publishers would consider unviable. Every additional percent of revenue that flows to AI middle layers rather than to the journalists doing the reporting is material. It is not sustainable for publishers to absorb this cost indefinitely, and the broader point I have been making about Indian informal economies in my gig workers analysis applies here too. When value gets extracted by a platform without flowing back to the workers producing the underlying value, the system eventually breaks in ways that hurt everyone including the platforms themselves.

The answer is not to stop AI development. AI is going to be built one way or another, and some of the applications, including in Indian contexts where Indian-language AI will serve communities that the dominant English-first web never reached, are genuinely valuable. The work ahead, much of it rooted in the same Indian mathematical and computing tradition I wrote about recently, needs Indian creators, journalists, and researchers at the centre of the economic model rather than at the margins of it. Licensing, royalties, attribution, and compensation frameworks exist in other media industries like music, film, and print publishing. They need to exist for AI training on journalism too, and the question for Indian publishers, publisher associations, and the Indian government is whether those frameworks get built proactively or reactively, and who gets a seat at the table when they do.

Bottom line

AI crawlers are scraping Indian news at industrial scale. The resulting AI systems produce answers about India without returning economic value to the publishers who produced the underlying journalism. Technical defences help at the margins, and every publisher should implement them this quarter, but the real answer is legal and regulatory frameworks that establish licensing and royalties, and that requires collective publisher action and active engagement with the Indian AI regulatory process. The 7.9 million request story is a warning. The question for Indian publishing is whether it treats the warning as such or waits for the next one, which will be larger and harder to respond to.

Meta’s 7.9 Million Scrape: What AI Training on Indian News Means for Publishers