4 May, 2026

product

Building ElectionGPT for the 2026 Tamil Nadu elections

Lugman Hussain Khan

ElectionGPT is a free chat application developed by Madhi AI for the 2026 Tamil Nadu legislative elections. The goal was simple. Let anyone ask questions about candidate details, party ideologies, and election promises in plain language without needing to search through PDFs, scanned affidavits, or long news articles.

We wanted the experience to feel conversational and fast. No signups. No complicated filters. Just ask a question and get an answer grounded in actual election data.

This post walks through how we built the system, the tradeoffs we made, and the retrieval pipeline that powers every response.

The hard part was the data

Before building retrieval pipelines or answer generation systems, we had to deal with the data itself.

Almost every source came in a different format. Some were image-heavy documents. Some were written entirely in Tamil. Even party manifestos varied wildly in structure and detail. A large part of the project ended up being data cleaning and extraction.

Processing party manifestos

The major parties released their manifestos in different formats and at different times. AIADMK published an English manifesto with roughly 44 pages. DMK, TVK, and NTK released theirs in Tamil. NTK's manifesto alone crossed 490 pages and contained long contextual explanations around each promise instead of short bullet points.

Our goal was not just OCR. We wanted clean, structured, English summaries of every promise that could later be searched reliably.

We used Gemini models for this extraction pipeline. The models received manifesto section images and returned structured JSON containing concise promise statements stripped of surrounding noise.

Manifesto Extraction Pipeline

Every extracted promise was manually verified against the original document. This step mattered more than expected. Translation and summarisation models can subtly drift in meaning. A small wording change in a manifesto promise can completely alter interpretation. Manual verification helped us keep the final indexed data trustworthy.

Extracting candidate affidavits

Candidate affidavits were even harder. These documents contain information like:

Educational qualifications
Occupation
Net worth
Criminal cases
Liabilities

Most of them were scanned Tamil documents with varying scan quality. Direct structured extraction with Gemini alone was unreliable, especially for large tables and numeric fields. So we split the process into two stages.

First, we passed the affidavit scans through Sarvam AI's OCR pipeline to generate markdown text. Then that OCR output was sent to Gemini 3 Flash for structured extraction into JSON.

Affidavit Extraction Pipeline

A typical structured output looked like this:

{
  "candidate_name": "S. Nandakumar",
  "party_name": "All India Puratchi Thalaivar Makkal Munnetra Kazhagam",
  "highest_qualification_text": "MBA",
  "occupation_text": "Self Employed",
  "pending_criminal_case_count": 1,
  "conviction_case_count": 0,
  "candidate_movable_assets_total": 36851467,
  "candidate_immovable_assets_total": 116541640,
  "candidate_total_liabilities": 35438356,
  "candidate_government_dues_total": 0,
  "latest_declared_income": 2327650
}

The final dataset let us answer analytical questions such as:

Which parties have the highest number of candidates with criminal cases?
Which constituency has the wealthiest candidates?

Collecting party ideologies

Party ideologies were sourced from official party websites, translated into English where required, and manually verified before indexing.

Compared to affidavits and manifestos, this part was relatively straightforward. The main challenge was keeping tone and meaning consistent across translations.

Designing the retrieval pipeline

Once the data was cleaned and structured, the next step was building retrieval systems that could answer wildly different types of queries.

Some users asked direct factual questions.

Who is contesting from Coimbatore South?

Others asked comparative or analytical questions.

Which party promises the most welfare schemes for women?

The retrieval pipeline had to support both.

Query routing

Every incoming query first passes through a lightweight classifier. The LLM classifier labels the request as one of three categories:

manifesto
candidate_lookup
no_retrieval

The output is a simple structured JSON.

{
  "queryType": "candidate_lookup"
}

no_retrieval covers queries that don't need a database lookup at all, such as questions about the system prompt, parties that aren't indexed, or anything else off-topic.

The router also receives prior chat context. We pass pairs of user messages and the last few lines of each generated response. This lets the router handle short follow-ups like "Ok", "Yes", or formatting tweaks like "Give me in table format".

Manifesto retrieval

Manifesto Retrieval Pipeline

Manifesto retrieval uses a hybrid search pipeline built on top of Milvus. Each indexed document represents a single promise from a single party. Every record contains:

A dense vector generated using Voyage-4-Large
A sparse BM25 representation
Metadata about the originating party and manifesto section

Before retrieval, every user query goes through a query expansion stage. The expansion model rewrites the query into a clearer semantic form and also generates keyword variations for BM25 search.

Example:

{
  "rephrasedQuery": "What specific promises are made about providing 200 free units of electric power to consumers?",
  "keywords": [
    "200 free units",
    "electric power",
    "free electricity",
    "electricity subsidy",
    "electricity promise",
    "free power units",
    "electricity allocation"
  ]
}

Why expand the query?

This step solved two important problems.

Recall. Carefully generated keywords consistently improved BM25 retrieval quality.
Conversational context.

If a user asks:

What are DMK's promises for women?

and then follows up with:

What about NTK?

the second query still carries the context of the first one and becomes meaningful only after the prior context is folded back into the rewritten query.

The rephrased query is embedded using Voyage-4-Large.

Retrieval then runs in parallel across party-specific partitions using both dense and sparse search. Each side retrieves candidates independently. Reciprocal Rank Fusion reranks the combined results before they are passed to the answer generation model.

The answer model filters the retrieved excerpts against the original user question and generates the final markdown response.

The routing, query expansion, and answer generation stack runs on a mix of GPT-OSS-20B and GPT-OSS-120B models.

Candidate information retrieval

Candidate queries behave very differently from manifesto queries.

Some are simple lookups.

Show candidates from Madurai Central

Others are analytical.

Which parties have the highest number of candidates with pending criminal cases?

To support this range of queries, the structured affidavit data is stored in SQLite instead of a vector database.

The pipeline itself is straightforward. An LLM generates SQL from the user query, the SQL runs against SQLite, and the query results are passed to the answer generation model along with the original user prompt.

Affidavit Generation Pipeline

One practical challenge was handling spelling variations across districts and constituencies.

Users refer to the same location in many different ways. For example, someone might ask about "Kovai" while the database stores the district as "Coimbatore".

Instead of injecting alias mappings directly into the prompt, we maintain a separate location_aliases table inside SQLite itself. During query generation, the model learns to resolve user-provided aliases through subqueries against this table.

For example:

Query: Candidates contesting in Kovai?

SELECT
  candidate_name,
  party,
  constituency,
  gender,
  age,
  education,
  occupation,
  pending_criminal_case_count,
  conviction_case_count,
  candidate_movable_assets_total,
  candidate_immovable_assets_total,
  candidate_total_liabilities,
  latest_declared_income,
  photo_url
FROM candidates
WHERE district = (
  SELECT canonical
  FROM location_aliases
  WHERE alias = 'kovai'
    AND type = 'district'
  LIMIT 1
)
ORDER BY constituency, candidate_name
LIMIT 100

Here, the mapping between Kovai and Coimbatore is resolved through the alias table before filtering the candidate records.

This approach ended up being far more reliable than hardcoding aliases into prompts. It also made the system easier to maintain as new spelling variations appeared during real-world usage.

Both SQL generation and answer generation run on GPT-OSS-120B.

Queries classified as generic skip retrieval entirely and go to the answer model with a prompt tuned for that case.

Why we avoided a fully agentic system

We considered making the whole thing agentic, exposing retrieval as tools and letting a model orchestrate the calls. We chose not to.

We were running on small, cheap open-source models, and we wanted snappy responses. An agent loop with these models tends to burn extra iterations deciding what to do, and it occasionally falls into doom loops that hurt both latency and accuracy. Splitting the flow into narrow, well-defined LLM tasks orchestrated by deterministic code keeps each call lean and the pipeline predictable.

It also makes the system easier to improve. Each step can be measured and tuned in isolation.

Cost and latency

ElectionGPT is free and has no signup. It needed to feel instant for users coming from any background. To hit that, we stuck with small open-source models, mainly GPT-OSS-120B and GPT-OSS-20B, which are noticeably faster and cheaper than comparable closed models.

The final latency numbers landed at:

6.54 seconds p50
15.39 seconds p99

That is the pipeline as it stands today. The election is the deadline, and the data work continues to be the bulk of the effort.

Final thoughts

Most of what makes ElectionGPT feel responsive is the work that happens before any user types a question. Manifesto promises extracted and verified line by line. Affidavit data normalised into a single schema. Place name aliases mapped out in advance. By the time a query arrives, the hard problems are already solved offline, and the runtime pipeline only has to do narrow, well-defined work.

That is what makes the small open-source models viable. GPT-OSS-20B and GPT-OSS-120B do not have to reason through a 490-page Tamil PDF or guess at constituency spellings. They classify, expand, write SQL, or summarise a handful of excerpts. Each call has a tight scope, which is exactly the kind of work these models handle well.

With a well-prepared dataset and a pipeline built around small focused tasks, cheap open-source models hold up in production. The architecture carries the quality, and the model is just one component inside it.