Back to blog
EngineeringMarch 26, 202610 min read

Building a RAG System That Actually Works

Vector search alone falls short. Here's how we combine embeddings, BM25 keyword matching, and reciprocal rank fusion to build product catalogs that understand what you're asking.

RK
Ryan Kearney
Building a RAG System That Actually Works

Retrieval-Augmented Generation has become the default architecture for building AI systems that need to answer questions about specific data. The concept is straightforward: instead of relying on what a language model memorized during training, you retrieve relevant documents from your own data and feed them to the model as context. The model generates answers grounded in your actual information instead of hallucinating.

The concept is simple. Making it work well in production is not.

We built a RAG-powered product intelligence system for a B2B waterproofing materials supplier. 84 products across English, Chinese, and Spanish. An AI chat interface for technical questions. A guided product finder wizard. And a filterable catalog. Here is what we learned about building RAG systems that actually deliver useful results.

Why pure vector search falls short

The standard RAG tutorial goes like this: chunk your documents, generate embeddings, store them in a vector database, and do similarity search when a user asks a question. This works well enough for demos. It breaks down in production for several reasons.

Vector embeddings capture semantic meaning. They are great at understanding that "waterproof membrane" and "moisture barrier" refer to similar concepts. But they are terrible at exact matches. If a user searches for product code "WP-3200" or a specific ASTM standard number, semantic similarity will not help. The embedding for "WP-3200" has no meaningful semantic relationship to the actual product it represents.

This is where most RAG implementations stop working for real business use cases. Users search for specific things. Part numbers, specification values, brand names, technical standards. Pure vector search treats these as semantic concepts when they are really lookup operations.

Hybrid search: combining the best of both approaches

The solution is hybrid search. We combine two retrieval methods and merge their results.

The first method is vector search using embeddings. We use ChromaDB to store document chunks as vectors. When a query comes in, we generate an embedding and find the most semantically similar chunks. This handles natural language questions like "what product works best for below-grade concrete" or "which membrane can handle ponding water."

The second method is BM25 keyword matching. BM25 is a traditional information retrieval algorithm that scores documents based on term frequency and inverse document frequency. It excels at exact and partial keyword matches. When someone searches for "WP-3200" or "ASTM D6083," BM25 finds the right documents because the terms literally appear in the text.

Neither method alone gives great results. Together, they cover each other's weaknesses.

Reciprocal rank fusion: merging two ranked lists

Having two separate ranked lists of results is not useful on its own. You need a way to combine them into a single ranked list. We use reciprocal rank fusion (RRF), a technique that merges ranked lists by assigning scores based on position.

The formula is simple. For each document in each ranked list, the RRF score is 1 / (k + rank), where k is a constant (we use 60). If a document appears in both lists, its scores are summed. Documents that rank highly in both vector search and BM25 get boosted to the top. Documents that rank highly in only one method still appear, but lower.

This approach is elegant because it does not require normalizing scores between the two methods. Vector similarity scores and BM25 scores operate on completely different scales. RRF sidesteps that problem entirely by only caring about relative position.

Chunking strategy matters more than you think

How you split your documents into chunks has an outsized impact on retrieval quality. We tested several approaches before landing on a strategy that works for product data.

The naive approach is fixed-size chunks with overlap. Split every document into 500-token segments with 50 tokens of overlap. This is fast and simple. It also regularly splits product specifications across two chunks, meaning neither chunk has the complete information.

For our product catalog, we use semantic chunking aligned to product boundaries. Each product is its own chunk, containing all specifications, applications, substrates, and pricing information in a single retrievable unit. When a user asks about a specific product or application, they get the complete product record, not a fragment of it.

For longer documents like technical datasheets, we use a hierarchical approach. A summary chunk captures the high-level product description and key specifications. Detail chunks contain the full technical data, installation instructions, and compatibility tables. The summary chunks handle broad queries. The detail chunks handle specific technical questions.

The language problem

Our system supports English, Chinese, and Spanish. This adds a layer of complexity that most RAG tutorials ignore entirely.

Multilingual embeddings exist, but their quality varies by language pair. English-Chinese similarity scores are less reliable than English-Spanish scores. Keyword matching across languages requires either translation at query time or multilingual indexing.

We took a pragmatic approach. Product data is stored in all three languages with explicit language tags. The system detects the query language and prioritizes same-language matches, but cross-language results are still included with lower weighting. For the BM25 index, we maintain separate indexes per language. For the vector store, we use a multilingual embedding model and accept the slight quality reduction on CJK text.

The guided product finder wizard avoids the language problem entirely by using structured selection instead of free-text search. The user picks application type, substrate, project scale, and budget through dropdowns and radio buttons. The filtering happens on structured data fields, not text search. This is intentional. For the most common use case of finding the right product, structured navigation outperforms free-text search in every language.

Evaluation: how you know it works

The most overlooked part of building a RAG system is evaluation. How do you know your retrieval is actually returning relevant results? How do you measure whether the generated answers are accurate?

We built an evaluation suite with 248 automated tests covering three categories. Retrieval accuracy tests verify that known queries return the expected products. We maintain a test set of 50 question-answer pairs across all three languages, and we run them against the retrieval pipeline after every change. If retrieval recall drops below our threshold, the deploy fails.

Answer quality tests verify that the generated responses contain correct information. We extract factual claims from generated answers and check them against the source product data. This catches hallucinations where the model invents specifications that do not exist.

Edge case tests cover the queries that break naive implementations. Empty queries, queries in mixed languages, queries with typos, queries about products we do not carry. Each of these has a defined expected behavior.

Practical advice for building your own

Start with your data, not your model. The quality of your source documents determines the ceiling of your RAG system. Clean, structured, complete data with a mediocre retrieval setup will outperform a state-of-the-art pipeline built on messy data.

Use hybrid search from day one. Pure vector search will look great in your demo and fail on the first real user query that contains a product code or specification number.

Chunk along semantic boundaries. If your data has natural units (products, articles, sections), use those as your chunk boundaries instead of arbitrary token counts.

Build an evaluation suite before you optimize. Without measurement, you are tuning parameters based on vibes.

And test with real users in their real language before you ship. The gap between how developers think users will search and how users actually search is the gap between a useful product and shelf-ware.

Have a project in mind?

We build custom AI-powered systems for businesses that need real solutions. If something here resonated, let's talk.