The 400 Million Problem
When ChatGPT launched, it changed the world. But it changed the English-speaking world far more than everyone else. India felt this gap immediately. Of India’s 1.4 billion people, only about 400 million speak English with any fluency. That means the AI revolution, the one rewriting how we work, learn, create, and communicate, is largely inaccessible to over a billion Indians.
A farmer in Bihar can’t ask ChatGPT about crop disease in Bhojpuri. A street vendor in Chennai can’t use AI to write a GST invoice in Tamil. A grandmother in Kolkata can’t dictate a letter in Bengali and have it understood. The most powerful technology of the 21st century doesn’t speak the languages of the world’s largest democracy.
India is trying to fix this. And the effort is one of the most ambitious AI projects on Earth.
Who’s Building India’s AI?
Multiple efforts are underway simultaneously, some government-backed, some startup-driven, some academic. The landscape is moving fast:
| Project | Builder | Languages | Status |
|---|---|---|---|
| BharatGPT / Hanooman | IIT Bombay + 7 IITs + Seetha Mahalaxmi Healthcare | 22 Indian languages + 8 foreign | Released (2024) |
| Krutrim | Ola (Bhavish Aggarwal) | 22 Indian languages | Released (2024), India’s first AI unicorn |
| OpenHathi | Sarvam AI | Hindi (primary), expanding | Open-source, released |
| IndicTrans / IndicBERT | AI4Bharat (IIT Madras) | 22 languages | Open-source, research-grade |
| Airavata | AI4Bharat | 22 languages | Open-source instruction model |
| IndiaAI Mission | Government of India | All scheduled languages | ₹10,372 crore allocated (2024) |
India’s approach to AI infrastructure investment goes beyond just building models, it includes GPU clusters, datasets, and training programmes.
Why Can’t India Just Use Translated ChatGPT?
This is the question most people ask. GPT-4, Claude, and Gemini already support Hindi, Tamil, and other Indian languages. So why build from scratch?
The answer lies in how language models actually learn. Western AI models are trained overwhelmingly on English-language data. When they “speak” Hindi, they’re essentially translating their English understanding. This creates several problems:
- Cultural blindness, Ask GPT-4 about Panchayati Raj governance and you get a textbook answer. Ask a locally-trained model and it understands the social dynamics of a sarpanch election in a way no English-first model can.
- Script handling, Indian languages use 13 different scripts (Devanagari, Tamil, Telugu, Bengali, Gurmukhi, etc.). Western models handle these as secondary outputs, not native ones. Spelling errors, script-mixing failures, and transliteration bugs are common.
- Code-switching, Real Indians don’t speak “pure” Hindi or “pure” Tamil. They mix languages constantly, Hinglish, Tanglish, Benglish. A mother texting her child writes “beta, dinner ke liye kya chahiye?” Western AI struggles with this natural code-switching.
- Domain vocabulary, Legal, medical, and agricultural terms in Indian languages are vastly under-represented in global training data. A farmer asking about kharif crop insurance needs AI that knows what kharif means without explanation.
The Data Challenge Nobody Talks About
Building an AI model requires massive amounts of text data. English has trillions of words available online. Hindi has a fraction of that. Konkani, Maithili, Dogri, and Bodo have almost nothing.
India’s 22 scheduled languages vary wildly in digital presence:
| Tier | Languages | Digital Data Availability |
|---|---|---|
| High | Hindi, Bengali, Tamil, Telugu, Marathi | Reasonable (millions of web pages) |
| Medium | Gujarati, Kannada, Malayalam, Punjabi, Odia | Limited (hundreds of thousands) |
| Low | Assamese, Maithili, Santali, Dogri, Bodo, Manipuri, Kashmiri | Extremely sparse |
This is where AI4Bharat’s work at IIT Madras becomes critical. They’ve built the largest open-source datasets for Indian languages, IndicCorp (with billions of tokens across 22 languages) and Sangraha (collection of web-crawled Indian language text). Without this foundational data work, none of the flashier model announcements would be possible.
What Indian AI Could Actually Change
If India succeeds in building AI that truly works in Indian languages, the implications go far beyond chatbots:
- Healthcare, AI-powered symptom checkers in local languages could serve the 600 million Indians who live more than 5 km from the nearest doctor. Voice-based AI (no literacy required) could bridge the gap for the 250 million Indians who can’t read.
- Agriculture, Real-time crop advisory, weather alerts, and market prices in the farmer’s own language and dialect. India’s agriculture sector employs 42% of the workforce but has almost zero AI penetration.
- Education, Personalised tutoring in the language the child actually thinks in, not the language the system teaches in. This could address India’s learning crisis at scale.
- Government services, Navigating government schemes (there are 700+ central schemes alone) requires understanding complex bureaucratic language. AI translators could make welfare accessible to those who need it most.
- Legal access, 95% of court proceedings happen in English or Hindi. AI translation could make justice accessible in all 22 scheduled languages.
The Risks Nobody’s Discussing
India’s AI rush has a shadow side that deserves attention:
- Surveillance potential, AI that understands Indian languages at scale also means AI that can monitor, censor, and profile in Indian languages at scale. India’s data protection framework is still evolving, and the gap between AI capability and privacy safeguards is widening.
- Deepfakes in Indian languages, If AI can generate convincing Hindi or Tamil, it can generate convincing misinformation in those languages. India already has a WhatsApp misinformation problem. AI could supercharge it.
- Job displacement, India’s BPO and IT services sector employs 5+ million people, many in language-related roles (translation, content moderation, customer service). Multilingual AI directly threatens these jobs.
- Corporate capture, If the best Indian-language AI is proprietary (Krutrim, BharatGPT), then access to AI in your own language becomes gated by corporate pricing. Open-source alternatives (AI4Bharat, Sarvam) are the counterbalance.
The Bigger Question
India’s AI language project is ultimately about a deeper question: who gets to participate in the future?
If AI remains primarily an English-language tool, then the digital divide that already separates urban India from rural India, educated India from uneducated India, rich India from poor India, will harden into something permanent. The existing digital divide will become an intelligence divide.
But if India can build AI that works as naturally in Marathi as it does in English, in Assamese as fluently as in Hindi, then it has the potential to be the most democratising technology since UPI. Just as UPI made financial services accessible to 500 million previously unbanked Indians, multilingual AI could make knowledge services accessible to a billion linguistically excluded Indians.
The race is on. The question is whether India builds AI for all its languages, or just the profitable ones.