Building a Semantic Search Engine for arXiv ML Papers

From 624k raw abstracts to a deployed search and Q&A interface — the pipeline, the infrastructure decisions, and how the project evolved from simple retrieval to grounded answers.

arXiv publishes thousands of machine learning papers a month. The site’s built-in search is keyword-based: it finds papers containing your search terms. That works when you know exactly what you’re looking for. It doesn’t work as well when you want to ask a question — “why do transformers outperform RNNs on long sequences?” — and find the papers most relevant to that idea, regardless of whether your exact phrasing appears anywhere in the abstracts.

That was the starting point for aipapers.chat. I wanted to build something that let you search by meaning, not by keyword.

The corpus

arXiv makes its metadata available through a Kaggle snapshot (~4GB) covering the full repository — 1.7 million papers across every category, from mathematics to astrophysics. I scoped this project to modern ML and AI research, filtering to five categories:

cs.AI — Artificial Intelligence
cs.LG — Machine Learning
cs.CL — Computation and Language (NLP)
cs.CV — Computer Vision and Pattern Recognition
stat.ML — Statistics / Machine Learning

These five cover the bulk of deep learning research — language models, computer vision, general ML methods, and the statistical foundations underlying them. I also restricted to papers from 2020 onward, which keeps the index focused on the transformer era and removes older work that would add noise for questions about modern approaches. That filtering brought the bulk load down to around 450,000 abstracts.

The Kaggle snapshot is a one-time bootstrap, not a live feed. Its update schedule is irregular, and re-downloading 4GB to get a few thousand new papers isn’t practical. So I wrote a second ingestion script that pulls directly from the arXiv API, fetching papers published since a given date:

python3 01_ingest/fetch_arxiv.py --since 2025-01-01 --output data/incremental

At this point I also expanded the indexed categories, adding cs.IR (Information Retrieval), cs.NE (Neural and Evolutionary Computing), cs.RO (Robotics), cs.MA (Multiagent Systems), and eess.IV (Image and Video Processing). Between the incremental paper updates and the new categories, the index has grown to 624,000 abstracts. Each step is idempotent — re-running skips anything already processed — and a typical weekly update costs under $0.10 in Bedrock embedding charges.

Abstracts are a good unit to embed. They’re short enough to fit in a single embedding call, dense with meaning, and written specifically to communicate the paper’s contribution. A semantic query should land near the relevant abstract even when the wording is different.

Ingestion and embedding

The pipeline: Kaggle snapshot → filter → shard into 10K-paper JSONL files → upload to S3 → embed with Bedrock → write embedding shards back to S3.

For the embedding model I used Amazon Titan Embed Text v2, which produces 1024-dimensional vectors. I ran 20 concurrent Bedrock workers with each shard independent, so the job is resumable. The full embedding run cost roughly $1–2. That surprised me — 624K abstracts at a few hundred tokens each, at $0.0001 per 1K tokens, works out to almost nothing. It meaningfully changes the calculus on when re-embedding a corpus is worth it.

One thing to commit to early: once you choose an embedding model the index is bound to it. Titan and other models each learn their own vector geometry through training — a query embedded with one model can’t meaningfully search an index built with another. Upgrading later means re-embedding everything from scratch, which for this corpus is a few hours and a couple of dollars, but for a production system with hundreds of millions of documents is a significant operation.

Vector indexing

Embeddings go into Amazon OpenSearch Service with k-NN enabled and HNSW as the indexing algorithm. HNSW builds a multi-layer graph that makes approximate nearest-neighbor search O(log n) rather than brute-force O(n). At 624K vectors, brute force isn’t usable. With HNSW, queries come back in under 250ms.

One infrastructure decision I got wrong first: I started with OpenSearch Serverless. The appeal is obvious for a project like this — no cluster to manage, scales automatically. The problem is that Serverless charges for a minimum of 2 OCUs at $0.24/OCU-hour regardless of how little traffic you’re actually getting. That’s about $34/day for a demo that might get queried a few times a day. Switching to a provisioned r6g.large single-node managed cluster brought it to ~$99/month — roughly 10× cheaper at this traffic level. Serverless makes sense when you have unpredictable, bursty traffic. For a system with known low volume, provisioned capacity is more predictable and significantly cheaper.

The /search endpoint

The first version of the interface was a search box that returned ranked paper cards — title, arXiv ID, categories, publication date, and the truncated abstract — each annotated with a cosine similarity score showing how close the result was to the query vector.

Typing “transformer attention mechanism” returned papers about attention, self-attention, and related architectures, ranked by semantic similarity. The similarity scores (typically 0.72–0.80 for good results) gave you a sense of how relevant each paper actually was — something keyword search doesn’t provide.

It worked well for exploration. But using it made the limitation obvious pretty quickly: you could find the relevant papers, but you still had to read through the abstracts yourself to synthesize an answer to your actual question. The search was finding the right haystack. You were still looking for the needle.

Adding /rag — from retrieval to answers

The next step was to add a /rag endpoint that used the retrieved papers as context to generate a direct answer.

The RAG pattern is: embed the query → retrieve top-k papers from OpenSearch → inject abstracts as structured context into the LLM prompt → generate an answer with numbered citations. The prompt explicitly constrains the model to the retrieved context:

Based ONLY on the following research paper abstracts, answer the question.
If the papers don't contain enough information, say so.
Cite papers by their number [1], [2], etc.

The citation requirement matters. It means you can trace every claim in the answer back to a specific paper. The model can’t drift into generic training knowledge if the prompt grounds it in what was actually retrieved.

For this step I used Claude Haiku as the generation model. Synthesizing a handful of retrieved abstracts into a clear answer is well within Haiku’s capabilities, and keeping the model cost low matters when the per-query infrastructure cost is already dominated by OpenSearch.

The shift from /search to /rag changed the interface significantly. Instead of a ranked list of paper cards, you got a structured answer — with section headers, key findings, and citation numbers — synthesized from the retrieved papers. Same query, much more useful output.

Deployment

The API is a FastAPI app deployed to AWS Lambda as a zip package, fronted by API Gateway (HTTP API) and a CloudFront distribution handling the custom domain and HTTPS. Lambda scales to zero, so there’s no idle compute cost outside of OpenSearch. The frontend is a React/TypeScript app built with Vite and served as static assets from FastAPI.

CloudFront turned out to be necessary for a non-obvious reason: API Gateway HTTP APIs only accept HTTPS — there’s no HTTP listener at all. A browser hitting http://aipapers.chat gets no response, not even a redirect, which shows up as “Could not connect to server.” CloudFront accepts both protocols and redirects HTTP to HTTPS at the edge with a single policy setting, which fixes the problem immediately without waiting for HSTS preload list propagation.

Total infrastructure cost at this scale: ~$99/month for OpenSearch, a few cents for Lambda and API Gateway at low query volumes, Bedrock charges only at inference time. CloudFront falls within the free tier at this traffic level.

What came next

Having a working /search and /rag interface exposed a natural limitation: every question was independent. You could ask one question and get a good answer. But there was no memory between questions — each call started fresh with no awareness of what you’d just asked. Adding a proper conversational layer, where you could ask follow-up questions in context, turned out to introduce a specific technical problem that wasn’t obvious until I built it. That’s covered in the next post.