Rewrite user queries before your search engine sees them

Your search returns junk when a user types latest advancements LLMs healthcare? and bounces before scrolling. The query is short, ungrammatical, and missing the keywords your index is built on — but it is exactly how humans search. The fix is a small LLM call that sits in front of retrieval and returns a structured object: a cleaned-up rewritten query, an extracted keyword list, a set of hypothetical questions the user might actually be asking, and a one-line summary. By the end of this post you will have the Python, the Pydantic schema, the example output, and a short list of the failure modes you will hit the week after you ship it.

The shape of the problem

Users do not type queries. They type fragments. latest advancements LLMs healthcare? is four nouns and a question mark. Your BM25 or vector index was tuned on cleaner text. The retrieval layer is doing its job — the input is the problem.

You have two bad options and one good one. You can train your users to write better queries (they will not). You can rebuild your retrieval stack around a model that handles fragments natively (expensive, slow, and you still want keywords for filters and analytics). Or you can put a small model in front of retrieval whose only job is to translate human input into something your existing stack already knows how to handle.

The third option is a few lines of Python.

The rewriter

You want one model call that returns four things: a rewritten query optimised for search, a list of extracted keywords, a list of hypothetical questions the original query might be asking, and a short summary. Structured output, not free text — the downstream code has to consume this without parsing.

Here is the implementation using instructor and Gemini Flash:

import instructor
import google.generativeai as genai
from pydantic import BaseModel, Field
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Configure the Gemini API key
genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))

# Initialize the Gemini client with Instructor
client = instructor.from_gemini(
    client=genai.GenerativeModel(
        model_name="models/gemini-1.5-flash-latest",
    ),
    mode=instructor.Mode.GEMINI_JSON,
)

class RewrittenQuery(BaseModel):
    rewritten_query: str = Field(description="The rewritten query for improved search.")
    keywords: list[str] = Field(description="Keywords extracted from the original query.")
    hypothetical_questions: list[str] = Field(
        default_factory=list,
        description="Hypothetical questions that the original query might be asking.",
    )
    summary: str = Field(description="A brief summary of the original query.")

def rewrite_query(query: str) -> RewrittenQuery:
    """Rewrites a user query using Gemini and returns a structured output."""
    return client.chat.completions.create(
        response_model=RewrittenQuery,
        messages=[
            {
                "role": "user",
                "content": f"""
                    Please analyze the following query and provide:
                    - A rewritten query for improved search relevance.
                    - A list of keywords extracted from the query.
                    - A list of hypothetical questions that the query might be asking.
                    - A brief summary of the query.

                    Query: {query}
                """,
            }
        ],
    )

if __name__ == "__main__":
    sample_query = "latest advancements LLMs healthcare?"
    rewritten_query = rewrite_query(sample_query)
    print(rewritten_query.model_dump_json(indent=2))

Run it against latest advancements LLMs healthcare? and you get:

<details class="admonition admonition-note" > <summary class="admonition-title">Output</summary> <div class="admonition-content"> <pre><code class="language-">{ "rewritten_query": "Recent advancements in large language models for healthcare applications", <br> "keywords": [ "LLMs", "large language models", "healthcare", "advancements", "applications", "medical" ],<br> "hypothetical_questions": [ "What are the most recent breakthroughs in using LLMs for healthcare?", "How are LLMs being applied to improve healthcare?", "What are the potential benefits and risks of using LLMs in healthcare?", "What are the latest research papers on LLMs in healthcare?" ],<br> "summary": "The query seeks information on the newest developments in the use of large language models within the healthcare industry." }</code></pre> </div> </details>

The rewritten query goes into your retriever. The keywords feed filters and analytics. The hypothetical questions become candidate queries for a multi-query retrieval pass, or sit next to the result list as "did you mean to ask?" suggestions. The summary is a cheap audit trail when you debug a bad retrieval later.

That is the whole rewriter. The interesting part is what happens after you put it in production.

What you will hit next

Three predictions for the team that ships this on Monday.

Your rewrite latency will become the new bottleneck. A Gemini Flash call is fast, but it is not free, and you are now adding it to every search request. Cache rewritten queries by hash — the same input string should never hit the model twice. Most search traffic is repeat queries with a long tail of novel ones; a small LRU cache will absorb the head of the distribution and your p50 will look almost untouched.
The hypothetical-questions output will turn out to be more useful than the rewrite for analytics. The rewrite tells your retriever what to do. The hypothetical questions tell you what your users were actually trying to ask. Pipe them into your analytics warehouse and group by week. You will discover product questions, support questions, and content gaps that nobody surfaced in any user interview. This is the part of the system that earns its keep after the first month.
Your rewrite model will hallucinate plausible-sounding keywords that match no documents in your index. "Medical" was not in the original query. If your corpus does not use that word, the rewrite has made retrieval worse by introducing a confident-looking term that drags relevance away from the documents that do exist. Log retrieval recall before and after the rewriter is enabled on a holdout set. Do not treat the rewrite as a win until you have seen that number move in the right direction on real traffic.

The shared lesson behind all three: a rewriter is not a drop-in upgrade. It is a new component with its own latency, its own analytics surface, and its own failure mode. Treat it that way and it will pay back. Bolt it on and forget about it and you will spend a quarter debugging a retrieval regression you introduced yourself.

The minimum to ship

Three things, in order:

Wrap the function above behind a cache keyed by the raw query string.
Send the rewritten query to your retriever and keep the original query in your logs.
Compare recall@10 on a 50-query holdout set, with and without the rewriter, before you flip it on for everyone.

If recall moves up, ship it. If it moves down, the rewrite is hallucinating keywords your corpus does not contain, and the fix is in the prompt — constrain the model to keywords drawn from a controlled vocabulary or from the original query itself.

If your search is returning junk for messy queries right now, send me your 10 worst-performing search queries from last week and I will write the rewriter prompt that fixes the top three. [email protected].