Vectorstore Retrieval

LangChain Tutorial 4

Jan Kirenz

Vectorstore Retrieval

Grasp advanced techniques for accessing and indexing data in the vector store, enabling you to retrieve the most relevant information beyond semantic queries.

Setup

Python

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader
from langchain.retrievers import TFIDFRetriever
from langchain.retrievers import SVMRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain.retrievers import ContextualCompressionRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.llms import OpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from dotenv import load_dotenv, find_dotenv
import os
import openai
# import sys
# sys.path.append('../..')

_ = load_dotenv(find_dotenv())  # read local .env file

openai.api_key = os.environ['OPENAI_API_KEY']

Vector Database

Setup

Let’s get our vectorDB from Tutorial 3.

persist_directory = '../docs/chroma/'

embedding = OpenAIEmbeddings()

vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

print(vectordb._collection.count())

Example

texts = [
    """The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).""",
    """A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.""",
    """A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.""",
]

smalldb = Chroma.from_texts(texts, embedding=embedding)

question = "Tell me about all-white mushrooms with large fruiting bodies"

Result 1

smalldb.similarity_search(question, k=2)

[Document(page_content=‘A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.’, metadata={}), Document(page_content=‘The Amanita phalloides has a large and imposing epigeous (aboveground) fruiting body (basidiocarp).’, metadata={})]

Result 2

smalldb.max_marginal_relevance_search(question, k=2, fetch_k=3)

[Document(page_content=‘A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all-white.’, metadata={}), Document(page_content=‘A. phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.’, metadata={})]

Addressing Diversity

Basics

Addressing Diversity: Maximum marginal relevance (MMR)
In Tutorial 3 we introduced one problem: how to enforce diversity in the search results.
Maximum marginal relevance strives to achieve both relevance to the query and diversity among the results.

Question about matlab

question = "what did they say about matlab?"

Similarity search

docs_ss = vectordb.similarity_search(question, k=3)

Results

docs_ss[0].page_content[:100]

‘those homeworks will be done in either MATLA B or in Octave, which is sort of — I some people’

docs_ss[1].page_content[:100]

‘those homeworks will be done in either MATLA B or in Octave, which is sort of — I some people’

MMR

docs_mmr = vectordb.max_marginal_relevance_search(question, k=3)

Note the difference in results with MMR.

docs_mmr[0].page_content[:100]

‘those homeworks will be done in either MATLA B or in Octave, which is sort of — I some people’

docs_mmr[1].page_content[:100]

“mathematical work, he feels like he’s disc overing truth and beauty in the universe. And says it”

Addressing Specificity: Metadata

Basics

Addressing Specificity: working with metadata
In Tutorial 3, we showed that a question about the third lecture can include results from other lectures as well.
To address this, many vectorstores support operations on metadata.
metadata provides context for each embedded chunk.

Question about third lecture

question = "what did they say about regression in the third lecture?"

Similarity search

docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source": "../docs/cs229_lectures/MachineLearning-Lecture03.pdf"}
)

Result

for d in docs:
    print(d.metadata)

{‘source’: ‘../docs/cs229_lectures/MachineLearning-Lecture03.pdf’, ‘page’: 0}
{‘source’: ‘../docs/cs229_lectures/MachineLearning-Lecture03.pdf’, ‘page’: 14}
{‘source’: ‘../docs/cs229_lectures/MachineLearning-Lecture03.pdf’, ‘page’: 4}

Addressing Specificity: Self-query retriever

Basics

Addressing Specificity: working with metadata using self-query retriever
But we have an interesting challenge: we often want to infer the metadata from the query itself.
To address this, we can use SelfQueryRetriever, which uses an LLM to extract:

The query string to use for vector search
A metadata filter to pass in as well

Most vector databases support metadata filters, so this doesn’t require any new databases or indexes.

metadata_field_info

metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The lecture the chunk is from, should be one of `../docs/cs229_lectures/MachineLearning-Lecture01.pdf`, `../docs/cs229_lectures/MachineLearning-Lecture02.pdf`, or `../docs/cs229_lectures/MachineLearning-Lecture03.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the lecture",
        type="integer",
    ),
]

document_content_description

document_content_description = "Lecture notes"
llm = OpenAI(temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

Question about third lecture

question = "what did they say about regression in the third lecture?"

Retriever

docs = retriever.get_relevant_documents(question)

You will receive a warning about predict_and_parse being deprecated the first time you executing the next line. This can be safely ignored.

Result

for doc in docs:
    print(doc.metadata)

{'source': '../docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 14}
{'source': '../docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 0}
{'source': '../docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 10}
{'source': '../docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 10}

Additional Tricks: Compression

Basics

Another approach for improving the quality of retrieved docs is compression.
Information most relevant to a query may be buried in a document with a lot of irrelevant text.
Passing that full document through your application can lead to more expensive LLM calls and poorer responses.
Contextual compression is meant to fix this.

Helper function: pretty print

def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" +
          d.page_content for i, d in enumerate(docs)]))

Load LLM

# Wrap our vectorstore
llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

ContextualCompressionRetriever

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

Question about matlab

question = "what did they say about matlab?"

Retriever

compressed_docs = compression_retriever.get_relevant_documents(question)

Result

pretty_print_docs(compressed_docs)

Document 1:

"MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms."
----------------------------------------------------------------------------------------------------
Document 2:

"MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms."
----------------------------------------------------------------------------------------------------
Document 3:

"And the student said, "Oh, it was the MATLAB." So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard, and we'll actually have a short MATLAB tutorial in one of the discussion sections for those of you that don't know it."
----------------------------------------------------------------------------------------------------
Document 4:

"And the student said, "Oh, it was the MATLAB." So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard, and we'll actually have a short MATLAB tutorial in one of the discussion sections for those of you that don't know it."

Combining Various Techniques

ContextualCompressionRetriever with MMR

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type="mmr")
)

Question

question = "what did they say about matlab?"

Retriever

compressed_docs = compression_retriever.get_relevant_documents(question)

Result

pretty_print_docs(compressed_docs)


Document 1:

"MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to plot data. And it's sort of an extremely easy to learn tool to use for implementing a lot of learning algorithms."
----------------------------------------------------------------------------------------------------
Document 2:

"And the student said, "Oh, it was the MATLAB." So for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard, and we'll actually have a short MATLAB tutorial in one of the discussion sections for those of you that don't know it."

Other Types of Retrieval

Basics

It’s worth noting that vectordb as not the only kind of tool to retrieve documents.
The LangChain retriever abstraction includes other ways to retrieve documents, such as TF-IDF or SVM.

Load

# Load PDF
loader = PyPDFLoader("../docs/cs229_lectures/MachineLearning-Lecture01.pdf")

pages = loader.load()

all_page_text = [p.page_content for p in pages]

joined_page_text = " ".join(all_page_text)

Split

# Split
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500, chunk_overlap=150)

splits = text_splitter.split_text(joined_page_text)

Retrieve with SVM and TF-IDF

Support vector machine (SVMs) retriever

# Retrieve
svm_retriever = SVMRetriever.from_texts(splits, embedding)

TF-IDF: term-frequency times inverse document-frequency retriever

tfidf_retriever = TFIDFRetriever.from_texts(splits)

SVM retriever

question = "What are major topics for this class?"

docs_svm = svm_retriever.get_relevant_documents(question)
docs_svm[0]

Document(page_content=‘don't have a MATLAB license, for the purposes of this class, there's also — [inaudible] that down [inaudible] MATLAB — there' s also a software package called Octave you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it's free, and for the purposes of this class, it will work for just about . actually I, well, so yeah, just a side comment for those of you that haven't seen before I guess, once a colleague of mine at a different university, not at , actually teaches another machine l earning course. He's taught it for many years. one day, he was in his office, and an old student of his from, lik e, ten years ago came his office and he said, “Oh, professo r, professor, thank you so much for your learning class. I learned so much from it. There's this stuff that I learned in your , and I now use every day. And it's help ed me make lots of money, and here's a of my big house.” my friend was very excited. He said, “W ow. That's great. I'm glad to hear this learning stuff was actually useful. So what was it that you learned? Was it regression? Was it the PCA? Was it the data ne tworks? What was it that you that was so helpful?” And the student said, “Oh, it was the MATLAB.” for those of you that don't know MATLAB yet, I hope you do learn it. It's not hard,’, metadata={})

TFIDF retriever

question = "what did they say about matlab?"

docs_tfidf = tfidf_retriever.get_relevant_documents(question)

docs_tfidf[0]

Document(page_content=“Saxena and Min Sun here did, wh ich is given an image like this, right? This is actually a taken of the Stanford campus. You can apply that sort of cl ustering algorithm and the picture into regions. Let me actually blow that up so that you can see it more . Okay. So in the middle, you see the lines sort of groupi ng the image together, the image into [inaudible] regions. what Ashutosh and Min did was they then applied the learning algorithm to say can take this clustering and us e it to build a 3D model of the world? And so using the , they then had a lear ning algorithm try to learn what the 3D structure of the looks like so that they could come up with a 3D model that you can sort of fly , okay? Although many people used to th ink it’s not possible to take a single and build a 3D model, but using a lear ning algorithm and that sort of clustering is the first step. They were able to. ’ll just show you one more example. I like this because it’s a picture of Stanford with our Stanford campus. So again, taking th e same sort of clustering algorithms, taking same sort of unsupervised learning algor ithm, you can group the pixels into different . And using that as a pre-processing step, they eventually built this sort of 3D model of Stanford campus in a single picture. You can sort of walk into the ceiling, look”, metadata={})

Acknowledgments

This tutorial is mainly based on the excellent course “LangChain: Chat with Your DataI” provided by Harrison Chase from LangChain and Andrew Ng from DeepLearning.AI.

What’s next?

Congratulations! You have completed this tutorial 👍

Next, you may want to go back to the lab’s website