Back to blog
RAGIAembeddingspgvector

RAG for companies: how we built an intelligent search engine over 124,000 documents

By Daniel Petrasic12 min read

The problem: 124,000 documents that nobody can search

Chile has an enormous legal corpus: 7 base codes, more than 14 special laws, thousands of rulings from the Labor Directorate, IRS circulars, Constitutional Court decisions, Supreme Court rulings, and opinions from the Comptroller General. In total, more than 124,000 legal documents that affect the daily lives of millions of people.

The problem: this information is scattered across dozens of institutional websites, in different formats (PDF, HTML, XML), without a unified search engine that understands natural language. If you want to know “can I work on Sundays?”, you need to know that the answer lies in Article 38 of the Labor Code, in the Labor Directorate rulings that interpret it, and in Law 21,561 that modified the workweek.

We built a system that answers that question with verifiable citations in under 3 seconds. Here’s how it works.


What is RAG?

Retrieval-Augmented Generation(RAG) is an architecture pattern that combines semantic search with language generation. Instead of asking an AI model to “know” the answer (which leads to hallucinations), you provide the relevant documents as context and ask it to respond based solely on them.

The flow is simple in concept, complex in execution:

  1. The user asks a question in natural language.
  2. The question is converted into a numerical vector (embedding) that captures its semantic meaning.
  3. The most similar documents are retrieved from a vector database (the 5-10 most relevant out of 124,000).
  4. The retrieved documents are sent as context to a language model, along with strict instructions: “respond only with what is in these documents, cite the source”.
  5. The model generates a response with verifiable citations that the user can check against the original source.

Real architecture: what we use

This is not a theoretical architecture. It’s what runs in production today, answering legal queries from real users:

Embeddings: Voyage AI (voyage-3.5, 1024 dimensions). We evaluated OpenAI, Cohere, and Voyage AI. Voyage AI won in search quality for Spanish text and in cost: 50 million free tokens per month. For our current volume, the embedding cost is $0.

Vector database: PostgreSQL + pgvector.We don’t use Pinecone, Weaviate, or Qdrant. We use PostgreSQL with the pgvector extension, hosted on Supabase. Why? Because we already had PostgreSQL for the rest of the application, and pgvector supports cosine similarity search with HNSW/IVFFlat indexes. Less infrastructure, lower costs, fewer points of failure.

LLM: Groq with Llama 3.3 70B. We chose Groq for speed (tokens per second) and cost (generous free tier). Llama 3.3 70B responds well in Spanish and follows citation instructions with precision. The inference cost for our volume: $0.

Frontend: Next.js with streaming. The response is sent token by token to the user using Server-Sent Events. The user sees the response being built in real time, which improves the perception of speed.


The real challenge: data quality

The architecture is the easy part. The real work is in the data. Each legal source has its own complexities:

Codes and lawscome from the National Library of Congress (BCN) API in XML format. The structure is hierarchical (books → titles → paragraphs → articles). Each article is vectorized individually, preserving its hierarchical context in the embedding.

Labor Directorate rulingsdon’t have an API. We obtained them via scraping the Labor Directorate website, extracting the legal summaries from each ruling (the original PDFs are scanned images without a text layer).

IRS circulars are in PDFs with extractable text. Each circular was downloaded, the text was extracted, and it was split into chunks of ~5,000 characters with 200-character overlap to avoid losing context at the boundaries.

Supreme Court decisions were the biggest challenge: ~30,000 decisions obtained from the Judiciary, each with full text that can exceed 20,000 characters. Intelligent chunking preserving the FINDINGS, CONSIDERATIONS, and RESOLUTION sections.


Real costs: how much it costs to operate this

One of the most frequent questions we receive is about costs. Here is the reality:

  • Embeddings (Voyage AI): $0/month. The free tier covers our current volume. If we grew 10x, it would be ~$3 USD/month.
  • LLM (Groq): $0/month. 6,000 requests/day on the free tier. With our current volume, we don’t come close to the limit.
  • Database (Supabase Pro): $25 USD/month. 8 GB of disk for 124,000 documents with 1024-dimension vectors.
  • Hosting (Vercel): $0/month. The hobby tier handles our current traffic without issues.
  • Total: ~$25 USD/month to operate a semantic search engine over 124,000 legal documents.

This isn’t magic or a subsidy: it’s the result of choosing services with generous free tiers and optimizing the architecture to minimize external API calls (embedding cache, intelligent rate limiting).


Lessons learned

After a year building and operating this system, these are the lessons we wish we had known beforehand:

Chunking quality matters more than the model. You can use the best LLM in the world, but if your chunks are poorly split (an article cut in half, a ruling without its conclusion), the responses will be mediocre. We invested more time in the chunking strategy than in any other component.

Vector indexes have memory limits.With 82,000 documents, an HNSW index works fine with 64 MB of RAM. With 124,000, we needed to switch to IVFFlat because the shared server didn’t have enough memory. Plan for your index growth from day 1.

Rate limiting from the start.AI APIs have strict limits. Without global rate limiting, a single user can exhaust everyone’s quota. Implement rate limiting per user AND globally before your first real user.

Legal data changes. Laws are amended, rulings are issued daily, circulars are updated. You need an automatic update pipeline, not a one-time load. We have cron jobs that detect new laws from the BCN API every day.


RAG for your company?

The RAG pattern isn’t exclusive to the legal field. It works for any organization with a large document corpus that needs to be searchable: technical manuals, internal policies, contracts, financial reports, product documentation.

The difference between “a chatbot that hallucinates” and “an assistant that cites its sources” is a well-implemented RAG. If you have documents that your teams can’t search efficiently, we can help.

MORE

Keep reading

Got a project that fits these topics?

Contact us