RAG with Spring AI: Make the Model Answer From Your Own Data

Aman Sahni
2026 · 8 min read

In the first post we got a ChatClient talking to a model in a few lines. But that model only knows what it was trained on. It has never seen your product docs, your internal wiki, or last week's incident reports. Ask it about any of that and it will either say "I don't know" or — worse — confidently make something up.

RAG fixes this. And in Spring AI, it's not a new framework to learn. It's one advisor you attach to the same ChatClient you already built.

The one idea — RAG = "look it up, then answer." You retrieve the relevant chunks of your data, paste them into the prompt as context, and let the model answer from that. Spring AI's QuestionAnswerAdvisor does the retrieve-and-paste step for you.

This post builds directly on the first one. Same Java 21+, same Spring AI 1.1.x. By the end you'll have an endpoint that answers questions using documents you loaded — and you'll understand the three moving parts well enough to swap any of them.

Why the model needs your data pasted in

An LLM is frozen at training time. It has no live access to your database and no memory of your business. There are only two ways to give it knowledge it doesn't have: fine-tune it (expensive, slow, and stale the moment your data changes), or hand it the relevant facts at question time inside the prompt.

RAG is the second option, done well. The trick is the "relevant" part — you can't paste your entire wiki into every prompt; it won't fit, and it would cost a fortune in tokens. So you store your data as embeddings (vectors that capture meaning), and at question time you pull back only the handful of chunks most similar to what the user asked. That's a similarity search, and it's the whole game.

The three moving parts

Every RAG setup in Spring AI is the same three pieces. Learn these names and the rest is wiring:

EmbeddingModel — turns text into vectors. Auto-configured by your starter, just like the chat model.
VectorStore — stores those vectors and runs similarity search. Can be an in-memory store for demos or a real database (PGVector, Redis, Pinecone) in production.
QuestionAnswerAdvisor — the glue. It intercepts the prompt, searches the VectorStore, and appends the results as context before the model sees it.

Notice what's not on that list: any change to your controller's mental model. You're still calling chatClient.prompt().user(...).call(). You're just adding an advisor.

Step 1: Add the vector-store advisor dependency

The advisor lives in its own module. Add it alongside the OpenAI starter from the first post:

<dependency>
  <groupId>org.springframework.ai</groupId>
  <artifactId>spring-ai-advisors-vector-store</artifactId>
</dependency>

For Gradle:

implementation "org.springframework.ai:spring-ai-advisors-vector-store"

The BOM you already imported manages the version, so there's no version number here. That's the whole point of the BOM.

Step 2: A vector store you don't have to install

You don't need to stand up a database to learn RAG. Spring AI ships SimpleVectorStore — an in-memory store that's perfect for a first pass. Declare it as a bean and hand it the auto-configured EmbeddingModel:

@Bean
public VectorStore vectorStore(EmbeddingModel embeddingModel) {
    return SimpleVectorStore.builder(embeddingModel).build();
}

Dev vs prod — SimpleVectorStore keeps everything in memory and disappears on restart. That's fine for learning and demos. For production you swap this one bean for PGVector, Redis, or another store — and because everything downstream talks to the VectorStore interface, nothing else in your code changes. Same portability lesson as swapping chat providers in the last post.

Step 3: Load some documents

Before the model can answer from your data, the data has to be in the store. In real life you'd read PDFs or pull from a database; here we'll add a few documents by hand so the flow is clear. Spring AI splits, embeds, and stores them for you:

@RestController
public class KnowledgeController {

    private final VectorStore vectorStore;

    public KnowledgeController(VectorStore vectorStore) {
        this.vectorStore = vectorStore;
    }

    @PostMapping("/load")
    public String load() {
        var docs = List.of(
            new Document("Our refund window is 14 days from purchase."),
            new Document("Enterprise plans include 24/7 priority support."),
            new Document("API rate limit on the free tier is 60 requests per minute.")
        );
        vectorStore.add(docs);
        return "Loaded " + docs.size() + " documents";
    }
}

Call POST /load once. Behind that single vectorStore.add(...) call, each document is run through the embedding model and stored as a vector. You never touch the math.

Step 4: Wire the advisor and ask a question

Now the payoff. Attach a QuestionAnswerAdvisor to the ChatClient and ask something only your documents know:

@RestController
public class RagController {

    private final ChatClient chatClient;

    public RagController(ChatClient.Builder builder, VectorStore vectorStore) {
        this.chatClient = builder
                .defaultAdvisors(QuestionAnswerAdvisor.builder(vectorStore).build())
                .build();
    }

    @GetMapping("/ask")
    public String ask(@RequestParam String question) {
        return chatClient.prompt()
                .user(question)
                .call()
                .content();
    }
}

Hit /ask?question=How long do I have to get a refund? and the model answers "14 days" — not because it was trained on your policy, but because the advisor found that document, pasted it into the prompt, and the model read it. Ask something your documents don't cover and a well-configured RAG setup will tell you it doesn't know, instead of inventing an answer.

Read the controller again. It's the exact shape from the first post. The only new line is .defaultAdvisors(...). That's RAG.

Tuning what gets retrieved

The default advisor searches every document and pulls the closest matches. In a real corpus you'll want more control — how many chunks, and how similar they must be to count. The builder takes a SearchRequest for exactly this:

QuestionAnswerAdvisor.builder(vectorStore)
    .searchRequest(SearchRequest.builder()
            .topK(4)
            .similarityThreshold(0.75)
            .build())
    .build();

topK caps how many chunks get pasted in (more context costs more tokens and can dilute the answer). similarityThreshold filters out weak matches so junk doesn't end up in your prompt. These two knobs fix most "the answers are vague" complaints.

If RAG answers feel off, it's almost always retrieval, not the model. Check what chunks came back before you blame the LLM.

When to graduate to the production pipeline

QuestionAnswerAdvisor is the zero-config, single-store path — ideal for getting RAG working. When your pipeline needs more — metadata filtering, joining results from multiple stores, rewriting the query before retrieval — Spring AI offers RetrievalAugmentationAdvisor (in the spring-ai-rag module), a composable pipeline you assemble from modular pieces.

Don't reach for it on day one. Start with QuestionAnswerAdvisor, ship something that works, and move up only when a real requirement pushes you there. That's the same discipline you'd apply to any abstraction in Spring.

Where this fits in the series

This is post two in our Spring AI track, and it builds on the same ChatClient foundation:

Done: Spring AI basics — your first ChatClient call.
You are here: RAG — answering from your own data.
Next: Different ways to create AI Agents in Java
Next: Tools & MCP — letting the model call your code: query a database, hit an API, take actions.
After that: Guardrails & LLM security — what to lock down before any of this ships.

RAG is the feature that turns "a chatbot that sounds smart" into "a system that knows your business." It's also the one most teams want first. Get this pattern solid and you've covered the majority of real-world AI backend work.

Go Deeper

Build production RAG, not toy demos

Document ingestion, chunking strategies, real vector databases, retrieval tuning, and evaluation — built specifically for Java & Spring Boot engineers, not Python tutorials rewritten in Java. No hype, just depth.

Explore the AI Course →