Vectorstores and Embeddings

LangChain Tutorial 3

Jan Kirenz

Vectorstores and Embeddings

Dive into the concept of embeddings and explore vector store integrations within LangChain.

Setup

Python

import os
import numpy as np
import openai
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma


#import sys
#sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

Loading and Splitting

Load data

# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("../docs/cs229_lectures/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("../docs/cs229_lectures/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("../docs/cs229_lectures/MachineLearning-Lecture02.pdf"),
    PyPDFLoader("../docs/cs229_lectures/MachineLearning-Lecture03.pdf")
]

docs = []
for loader in loaders:
    docs.extend(loader.load())

Define splitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

Split data

splits = text_splitter.split_documents(docs)
len(splits)
  • 209

Embeddings

OpenAIEmbeddings

  • Let’s take our splits and embed them.
embedding = OpenAIEmbeddings()

Examples

sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

Compare similarity

np.dot(embedding1, embedding2)
  • 0.9631851837941705
np.dot(embedding1, embedding3)
  • 0.7710851013557284
np.dot(embedding2, embedding3)
  • 0.7596334120325541

Chroma Vectorstore

Setup

persist_directory = '../docs/chroma/'
!rm -rf ../docs/chroma  # remove old database files if any

Store data

vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

Inspect data

print(vectordb._collection.count())
  • 209

Search vectorstore for email

question = "is there an email i can ask for help"
docs = vectordb.similarity_search(question,k=3)
len(docs)
  • 3

Inspect data

docs[0].page_content
  • “cs229-qa@cs.stanford.edu. This goes to an acc ount that’s read by all the TAs and me. So than sending us email individually, if you send email to this account, it will let us get back to you maximally quickly with answers to your questions. you’re asking questions about homework probl ems, please say in the subject line which and which question the email refers to, since that will also help us to route question to the appropriate TA or to me appropriately and get the response back to quickly. ‘s see. Skipping ahead — let’s see — for homework, one midterm, one open and term . Notice on the honor code. So one thi ng that I think will help you to succeed and well in this class and even help you to enjoy this cla ss more is if you form a study . start looking around where you’ re sitting now or at the end of class today, mingle a bit and get to know your classmates. I strongly encourage you to form study groups sort of have a group of people to study with and have a group of your fellow students talk over these concepts with. You can also post on the class news group if you want to that to try to form a study group. some of the problems sets in this cla ss are reasonably difficult. People that have the class before may tell you they were very difficult. And just I bet it would be fun for you, and you’d probably have a be tter learning experience if you form a”

Persist data

  • Let’s save this so we can use it later!
vectordb.persist()

Failure modes

Basics

  • This seems great, and basic similarity search will get you 80% of the way there very easily.

  • But there are some failure modes that can creep up.

  • Here are some edge cases that can arise

Search vectorstore for matlab

question = "what did they say about matlab?"
docs = vectordb.similarity_search(question,k=5)

Inspect data

docs[0]
  • Document(page_content=‘those homeworks will be done in either MATLA B or in Octave, which is sort of — I some people call it a free ve rsion of MATLAB, which it sort of is, sort of isn't. I guess for those of you that haven't s een MATLAB before, and I know most of you , MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to data. And it's sort of an extremely easy to learn tool to use for implementing a lot of algorithms. in case some of you want to work on your own home computer or something if you 't have a MATLAB license, for the purposes of this class, there's also — [inaudible] that down [inaudible] MATLAB — there' s also a software package called Octave you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it's free, and for the purposes of this class, it will work for just about . actually I, well, so yeah, just a side comment for those of you that haven't seen before I guess, once a colleague of mine at a different university, not at , actually teaches another machine l earning course. He's taught it for many years. one day, he was in his office, and an old student of his from, lik e, ten years ago came his office and he said, “Oh, professo r, professor, thank you so much for your’, metadata={‘source’: ‘../docs/cs229_lectures/MachineLearning-Lecture01.pdf’, ‘page’: 8})
docs[1]
  • Document(page_content=‘those homeworks will be done in either MATLA B or in Octave, which is sort of — I some people call it a free ve rsion of MATLAB, which it sort of is, sort of isn't. I guess for those of you that haven't s een MATLAB before, and I know most of you , MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to data. And it's sort of an extremely easy to learn tool to use for implementing a lot of algorithms. in case some of you want to work on your own home computer or something if you 't have a MATLAB license, for the purposes of this class, there's also — [inaudible] that down [inaudible] MATLAB — there' s also a software package called Octave you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it's free, and for the purposes of this class, it will work for just about . actually I, well, so yeah, just a side comment for those of you that haven't seen before I guess, once a colleague of mine at a different university, not at , actually teaches another machine l earning course. He's taught it for many years. one day, he was in his office, and an old student of his from, lik e, ten years ago came his office and he said, “Oh, professo r, professor, thank you so much for your’, metadata={‘source’: ‘../docs/cs229_lectures/MachineLearning-Lecture01.pdf’, ‘page’: 8})

Insights

  • Notice that we’re getting duplicate chunks (because of the duplicate MachineLearning-Lecture01.pdf in the index).

  • Semantic search fetches all similar documents, but does not enforce diversity.

  • docs[0] and docs[1] are indentical.

Search vectorstore for third lecture

question = "what did they say about regression in the third lecture?"
docs = vectordb.similarity_search(question,k=5)

Inspect data

for doc in docs:
    print(doc.metadata)
{'source': '../docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 0}
{'source': '../docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 14}
{'source': '../docs/cs229_lectures/MachineLearning-Lecture02.pdf', 'page': 0}
{'source': '../docs/cs229_lectures/MachineLearning-Lecture03.pdf', 'page': 6}
{'source': '../docs/cs229_lectures/MachineLearning-Lecture01.pdf', 'page': 8}
  • The question was about the third lecture, but includes results from other lectures as well.

  • We discuss approaches to handle these problems in the next tutorial

Acknowledgments

What’s next?

Congratulations! You have completed this tutorial 👍

Next, you may want to go back to the lab’s website