In my previous post I walked through a RAG example but glossed over the details. In this post I’ll back up and provide the details.
The key steps in RAG are
- Load the data
- Split the text into smaller chunks to fit within context limits
- Create a Document object
- Embed the document in vectors that represent semantic meaning
- Store the document—typically in vector stores. These are databases designed to store embeddings and provide fast semantic retrieval
- Invoke a retriever to query the back end to return the most relevant Document object
- Create a prompt for the LLM
Let’s walk through the steps shown in the previous post with these in mind.
Loading the document
First, we need to identify and load the documents. In our case, this consists only of a single text file with an excerpt from Romeo and Juliet. In most real-world scenarios you’ll have multiple data sources.
from langchain_community.document_loaders import TextLoader
loader = TextLoader("RomeoAndJuliet.txt", encoding="utf-8")
docs = loader.load()
Notice that we are using the langchain_community document loader to do the text loading. Langchain will be the principal framework we’ll be working with, and it can load many types of data.
Splitting the text
We saw how to chunk that data in the previous post. We begin by using a text splitter to break large text into overlapping chunks using token-based splitting (not characters). In our case, we will set each chunk to about 1,000 tokens with 200 tokens of overlap. The overlap ensures that nothing is lost.
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
encoding_name='cl100k_base',
chunk_size=1000,
chunk_overlap=200
)
chunks = loader.load_and_split(text_splitter)
The text splitter that we use is the same one OpenAI uses. The cl100k_base is the tokenizer used by many embedding models. The 200 token overlap prevents losing meaning at the boundaries of the chunks and helps embeddings preserve context.
We use a recursive text splitter because it splits text intelligently, splitting by paragraphs when possible, then by sentences if the paragraphs are too big, then by words and finally by characters.
Embedding in a vector store
embedding_model = OpenAIEmbeddings(model="text-embedding-ada-002")
The embedding_model knows how to take text and send it to OpenAI and get back a vector embedding (a list of numbers). Each chunk you pass into Chroma (see below) will be embedded using this model
Our next task is to build the vector store, using the chunks we created above
vectorstore = Chroma.from_documents(
chunks,
embedding_model,
collection_name="RomeoAndJuliet"
)
Here we embed each of the chunks. For each chunk Chroma calls embedding_model.embed_document which produces the vector
For each chunk, Chroma will store the vector embedding, the original text and the metadata such as the source file, etc. This is used for similarity search (see below).
The final value passed in is the collection_name. The vector store is saved under that name.
Getting the retriever
As noted in the previous post, the next step is to create the retriever, which we do from the vector store, telling it that we want the search_type to be similarity and telling it how many of the most relevant chunks to return.
You get back a LangChain Document with the text chunk and the metadata.
Instantiating the LLM
The next section in the previous post is self-explanatory until we instantiate the LLM.
llm = ChatOpenAI(
model="gpt-4o-mini",
temperature=0,
max_tokens=10000,
top_p=0.95,
frequency_penalty=1.2,
stop_sequences=['INST']
)
Here we are using the OpenAI gpt-40-mini LLM – a popular and inexpensive LLM for RAG.
We set the temperature, which is a value that determines randomness in the answer. 0 is deterministic and repeatable.
max_tokens sets the upper bounds on how long the model’s response can be.
top_p=0.95 is tricky. This says that the model should sample from the top 95% probability. However, with temperature set to 0, this is meaningless. If you tinker with temperature, however, this can be useful.
frequency_penalty controls how often a token can repeat in the result. We’re using 1.2 which is a strong penalty creating concise, non-repetitive answers.
stop_sequence says to stop generation when the model outputs INST. This just prevents the model from “leaking” into the next instruction.
That’s it! Together with the previous post, you are now fully equipped to implement your RAG. Enjoy!