Build a RAG System locally with Ollama

In this blog post, I’m going to build a Retrieval Augmented Generation(RAG) System locally with Ollama that works with any document.

Tech Stack

Ollama (with llama3 as LLM)
HuggingFace models for Embeddings
ChromaDB as VectorDB(for storing embeddings)
Gradio (for UI)
LangChain (for chains and connectors)

RAG System

RAG is basically LLM + Additional Information as Context in the query.

A RAG system integrates two components:

Retriever: This component searches a knowledge base to find relevant documents or passages based on a query.

Generator: This component uses the retrieved documents to generate a coherent and contextually appropriate response.

The idea is to do the following:

Document Loading - Load the domain specific documents
Document Splitting - Split the documents into chunks
Create Embeddings for the chunks
Store the Embeddings to a VectorDB
Retrieval - For the given Query, do a Similarity search using MMR algorithm and retrieve top-k vector embeddings
Write specific prompts to LLM and Query the LLM with additional domain context
Integrate this to a Gradio Chat UI

Design Diagram

Design

Let us see them in detail.

1. Document Loading

Langchain is a popular Open source framework for development of applications using LLM’s.

In this example, we are going to use the famous “Attention is All you need” paper in PDF format. The paper introduces the Transformer model, which uses self-attention mechanisms instead of recurrent or convolutional layers to process sequential data. We are going to use this paper to load to our local vectorstore and ask questions.

from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("document.pdf")
pages = loader.load()

2. Document Splitting

The next step is to use the text_splitter from langchain and split the document to smaller chunks. In this case, the chunk_size is set to 2000 and chunk_overlap to 150 so each chunk still has overlapping vectors for better searches.

from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=2000,
    chunk_overlap=150,
    length_function=len
)
chunks = text_splitter.split_documents(pages)

3. Create Embeddings for the chunks

Embeddings is a very powerful concept. Embeddings is the process of creating the vector representation(floating point numbers) of the record. The distance between the two records measured the relatedness. Here, I’m going to use the HuggingFace Embeddings sentence transformers model with langchain. As a pre-requsite, we need to get HUGGINGFACEHUB_API_TOKEN token as environment variable

from langchain_community.embeddings import HuggingFaceEmbeddings
embedding = HuggingFaceEmbeddings()

4. Store the Embeddings to a VectorDB

The next step is to store the embeddings generated in a vector store database. There are several options currently - ChromaDB, Pinecone, etc. For simplicity, and for purpose of running everything in our local machine, I’ve planned to use ChromaDB

from langchain.vectorstores import Chroma
persist_dir = 'docs/chroma_document_pdf'
vectordb = Chroma.from_documents(
    documents=chunks,
    embedding=embedding,
    persist_directory=persist_dir
)

vectordb.persist()

The vector representations are now stored in our local ChromaDB.

5. Retrieval

As mentioned earlier, Retrieval is the core of the entire process. I’m going to use the [MMR](docs_mmr = vectordb.max_marginal_relevance_search(question, k=3) print(docs_mmr)) algorithm to retrieve the top-k embeddings for the given question.

For example, assume the following question like to ask what this paper talk about?
Langchain is so powerful, that we can use vectordb chain to do the MMR.

question = "What does this document talk about?"
docs_mmr = vectordb.max_marginal_relevance_search(question, k=3)
print(docs_mmr)

6. Ask Questions - Send Additional Information as Context to LLM

So far, we have not discussed about the LLM. I have planned to use llama3 from Meta.

We can install Ollama locally and do the run the model locally. Ollama is like docker for AI.

Steps below.

We can either use brew install Ollama or download the installer from https://ollama.com/download
Run the command ollama run llama3 to pull the llama3 and that’s all needed to run the model locally.

Setup prompt as a template.
Prompt Engineering is so powerful where we can define personas/agents in GPT-4 models.

from langchain.prompts import PromptTemplate

# Build prompt
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer. 
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

Define a function, named here as chat_bishop and use the prompt template to answer for the questions.

def chat_bishop(question, history):
    from langchain.prompts import PromptTemplate

    # Build prompt
    template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer. 
    {context}
    Question: {question}
    Helpful Answer:"""
    QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

    result = qa_chain({"query": question})

    import pprint
    pp = pprint.PrettyPrinter(width=50)

    # Use the pretty printer to print the long string
    pp.pprint(result["result"])

    # response = llm.invoke(message)
    return result["result"]

The above code will use llama3 model which is running locally as LLM.

For the given question, we do a RAG search with maximal marginal reference and fetch the top 3 records. With the embeddings as context/additional information and the search query, the LLM is called.
The generated response will be provided to the user. We can also use BERT score to validate the response, but that is not done as part of this RAG pipeline.

There is also scope to use LangChain chains to store the history of the Q&A and use the chat history as well.

7. Gradio for UI

Gradio is such an amazing tool for quicker POC’s. The documentation is pretty good. The below code will generate a chatbot and run in localhost with a port.

import gradio as gr
gradio_interface = gr.ChatInterface(
    chat_bishop,
    chatbot=gr.Chatbot(),
    textbox=gr.Textbox(placeholder="Example: What is your Question?",
                       container=False, scale=7),
    title="Hey, welcome. Please ask your Question",
    description=f"Ask the chatbot a question!",
    theme='gradio/base',
    retry_btn=None,
    undo_btn="Delete Previous",
    clear_btn="Clear",
)

gradio_interface.launch()

The Gradio UI will look as below. Sample questions asked are shown.

Gradio

This is a simple example that demonstrates the powerful technologies - LangChain, RAG pipeline concepts and how similar systems can be built with more additional features.

The project rag-example is in Github.

Harish Kumar