Adding Conversational Search — and the Problem That Came With It

The /rag endpoint answered individual questions well. Turning it into a real chat interface meant solving a retrieval problem that only shows up once you start asking follow-ups.

The /rag endpoint worked well for single questions. You asked something, it retrieved relevant papers, synthesized an answer, cited its sources. But it had no memory. Every question was independent — there was no thread, no context carried between turns. You’d ask about batch normalization, get a good answer, then ask “how does that compare to layer norm?” and it would treat the second question as if the first had never happened.

The natural next step was to add a proper /chat endpoint: full multi-turn conversation, context preserved across turns, follow-up questions understood in context. Building it took about a day. The retrieval problem that emerged took longer to figure out.

The problem with follow-up questions

A follow-up question like “how does that compare to layer norm?” contains almost no semantic information in isolation. There’s no mention of what “that” refers to, no mention of what domain we’re in. If you embed that text and run k-NN against 624K paper abstracts, the resulting vector lands in a low-information region of the semantic space. You’ll get papers about comparison methods, or papers that happen to use the phrase “layer norm” in passing. Not what you wanted.

This matters because vector search is stateless — it only sees the query text you give it. It has no concept of the conversation that led to this question. The user’s intent is clear from context; the retrieval system doesn’t have that context.

The naive fix — concatenate the full conversation history and embed all of it — has its own problems. Long concatenated context dilutes the signal of the actual question and pushes the embedding toward the center of the conversation rather than the specific thing you’re asking about.

Query rewriting

The approach that works: before running retrieval, use the LLM to rewrite the user’s latest question into a standalone search query that captures the full conversational intent.

“How does that compare to layer norm?” becomes:

batch normalization vs layer normalization comparison in deep learning

That embeds cleanly near the right papers. The retrieval mechanism is unchanged — same index, same k-NN. The rewrite just gives it a query that actually means something on its own.

The rewriting prompt is simple: given the truncated conversation history and the user’s latest message, produce a short standalone search query that weaves in the key topic thread. The output is constrained to 64 tokens, which keeps the call cheap and fast. The rewrite is for retrieval only — the original question is still passed to the synthesis call so the response sounds natural.

The two-stage pipeline

Every /chat turn makes two sequential Claude Haiku calls:

Call 1 — query rewriting (~0.5s, max_tokens=64)

Fires only on turn 2 and beyond. The first question is standalone by definition, so it skips rewriting and goes directly to retrieval.

Call 2 — synthesis (~3–6s, max_tokens=1024)

Takes the full conversation history with retrieved paper abstracts injected into the final user turn, and generates a conversational answer with citations.

[
  {"role": "user",      "content": "What is batch normalization?"},
  {"role": "assistant", "content": "...prior answer..."},
  {"role": "user",      "content": "<papers>\n...\n</papers>\n\nHow does that compare to layer norm?"}
]

Papers are injected only into the most recent user turn, not into the full history. This keeps the earlier turns clean and scopes the retrieved context to the specific question it was retrieved for.

Splitting into two calls is intentional. The rewriting call is cheap enough that it could be swapped for a smaller or faster model without touching synthesis. Combining them into one call would lose that flexibility and make latency harder to optimize.

Surfacing uncertainty

There’s a related failure mode worth addressing directly: when retrieval quality is low, the model fills the gap with general training knowledge rather than staying silent or saying it doesn’t know. The response still sounds confident. The user has no way to tell whether the answer came from retrieved papers or from what Claude already knows.

The top cosine similarity score — between the rewritten query vector and the best-matching paper in the index — is a reasonable proxy for retrieval quality. Scores above 0.80 generally mean the index has closely relevant papers. Below 0.70, the results are noticeably less targeted.

When the top score falls below 0.70, the UI surfaces a low-confidence flag alongside the answer. The model still responds — it just makes the uncertainty visible rather than hiding it inside a fluent answer. It’s a small addition that meaningfully changes how you calibrate trust in specific responses.

What using /chat actually feels like

With query rewriting in place, the chat interface behaves the way you’d expect a research assistant to. You can ask a question, get a grounded answer with paper citations, ask “what are the practical tradeoffs?”, ask “which of those approaches is more common in production?”, and each follow-up retrieves relevant papers for that specific angle of the conversation.

The big difference from /rag isn’t just multi-turn memory — it’s that the retrieval adapts to where the conversation has gone. The papers cited in turn 4 are not necessarily the same ones from turn 1. Each answer is grounded in whatever the index considers most relevant to the current question, given where we’ve been.

After building the chat interface and feeling good about it, I decided to try improving retrieval quality further — specifically by adding BM25 lexical search alongside the vector search. That experiment is covered in the next post.