Document Splitting

LangChain Tutorial 2

Jan Kirenz

Document Splitting

Discover the best practices and considerations for splitting data.

Setup

Python

from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain.text_splitter import TokenTextSplitter
from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter
from dotenv import load_dotenv, find_dotenv
import os
import openai
# import sys
# sys.path.append('../..')

_ = load_dotenv(find_dotenv())  # read local .env file

openai.api_key = os.environ['OPENAI_API_KEY']

Splitting

Character Text Splitter

chunk_size = 26
chunk_overlap = 4

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

Text 1

Why doesn’t this split the string below?

text1 = 'abcdefghijklmnopqrstuvwxyz'

r_splitter.split_text(text1)

[‘abcdefghijklmnopqrstuvwxyz’]

Text 2

text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'

r_splitter.split_text(text2)

[‘abcdefghijklmnopqrstuvwxyz’, ‘wxyzabcdefg’]

Text 3

text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"

r_splitter.split_text(text3)

[‘a b c d e f g h i j k l m’, ‘l m n o p q r s t u v w x’, ‘w x y z’]

c_splitter.split_text(text3)

[‘a b c d e f g h i j k l m n o p q r s t u v w x y z’]

CharacterTextSplitter

c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator=' '
)

c_splitter.split_text(text3)

[‘a b c d e f g h i j k l m’, ‘l m n o p q r s t u v w x’, ‘w x y z’]

Recursive Splitting Details

RecursiveCharacterTextSplitter

RecursiveCharacterTextSplitter is recommended for generic text.

some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

Define splitter

c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator=' '
)

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separators=["\n\n", "\n", " ", ""]
)

Character Splitter output

c_splitter.split_text(some_text)

[‘When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the “backslash n” you see embedded in this string. Sentences have a period at the end, but also,’, ‘have a space.and words are separated by space.’]

Recursive Splitter output

r_splitter.split_text(some_text)

[“When writing documents, writers will use document structure to group content. This can convey to the reader, which idea’s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.”, ‘Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the “backslash n” you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.’]

Adapt splitter 1

Let’s reduce the chunk size a bit and add a period to our separators:

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "\. ", " ", ""]
)

r_splitter.split_text(some_text)

[“When writing documents, writers will use document structure to group content. This can convey to the reader, which idea’s are related”, ‘. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.’, ‘Paragraphs are often delimited with a carriage return or two carriage returns’, ‘. Carriage returns are the “backslash n” you see embedded in this string’, ‘. Sentences have a period at the end, but also, have a space.and words are separated by space.’]

Adapt splitter 2

r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
r_splitter.split_text(some_text)

r_splitter.split_text(some_text)

[“When writing documents, writers will use document structure to group content. This can convey to the reader, which idea’s are related.”, ‘For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.’, ‘Paragraphs are often delimited with a carriage return or two carriage returns.’, ‘Carriage returns are the “backslash n” you see embedded in this string.’, ‘Sentences have a period at the end, but also, have a space.and words are separated by space.’]

Split a Document

Load PDF

loader = PyPDFLoader("../docs/cs229_lectures/MachineLearning-Lecture01.pdf")
pages = loader.load()

Define splitter

text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)

Split document

docs = text_splitter.split_documents(pages)

Inspect data

len(docs)

len(pages)

Inspect data

print(docs[0].page_content[300:800])

my name’s Andrew Ng and I’ll be instru ctor for this class. And so I personally work in machine learning, and I’ ve worked on it for about 15 years now, and I actually think that machine learning is th e most exciting field of all the computer sciences. So I’m actually always excited about teaching this class. Sometimes I actually think that machine learning is not only the most exciting thin g in computer science, but the most exciting thing in all of human e ndeavor, so maybe a little b

Split Notion data

Load data

loader = NotionDirectoryLoader("../docs/Notion_DB")
notion_db = loader.load()

Define splitter

text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)

Split document

docs = text_splitter.split_documents(notion_db)

Inspect data

len(notion_db)

len(docs)

Inspect data {smaller}

print(docs[0].page_content)

# Getting Started
👋 Welcome to Notion!
Here are the basics:
- [ ]  Click anywhere and just start typing
- [ ]  Hit `/` to see all the types of content you can add - headers, videos, sub pages, etc.
    
    [Example sub page](https://www.notion.so/Example-sub-page-92f63253929d456bbf12cd696e21e045?pvs=21)
    
- [ ]  Highlight any text, and use the menu that pops up to **style** *your* ~~writing~~ `however` [you](https://www.notion.so/product) like
- [ ]  See the `⋮⋮` to the left of this checkbox on hover? Click and drag to move this line
- [ ]  Click the `+ New Page` button at the bottom of your sidebar to add a new page
- [ ]  Click `Templates` in your sidebar to get started with pre-built pages
- This is a toggle block. Click the little triangle to see more useful tips!
    - [Template Gallery](https://www.notion.so/181e961aeb5c4ee6915307c0dfd5156d?pvs=21): More templates built by the Notion community

Token splitting

Basics

We can also split on token count explicity, if we want
This can be useful because LLMs often have context windows designated in tokens
Tokens are often ~4 characters.

TokenTextSplitter 1

text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

text1 = "foo bar bazzyfoo"

text_splitter.split_text(text1)

[‘foo’, ’ bar’, ’ b’, ‘az’, ‘zy’, ‘foo’]

TokenTextSplitter 2

text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

docs = text_splitter.split_documents(pages)

docs[0]

Document(page_content=‘MachineLearning-Lecture01 ’, metadata={‘source’: ‘../docs/cs229_lectures/MachineLearning-Lecture01.pdf’, ‘page’: 0})

pages[0].metadata

{‘source’: ‘../docs/cs229_lectures/MachineLearning-Lecture01.pdf’, ‘page’: 0}

Context Aware Splitting with Markdown

Basics

Chunking aims to keep text with common context together.
A text splitting often uses sentences or other delimiters to keep related text together but many documents (such as Markdown) have structure (headers) that can be explicitly used in splitting.
We can use MarkdownHeaderTextSplitter to preserve header metadata in our chunks

Markdown example

markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n 
## Chapter 2\n\n \
Hi this is Molly"""

Headers to split on

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

MarkdownHeaderTextSplitter

markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

Split text

md_header_splits = markdown_splitter.split_text(markdown_document)

md_header_splits[0]

Document(page_content=‘Hi this is Jim this is Joe’, metadata={‘Header 1’: ‘Title’, ‘Header 2’: ‘Chapter 1’})

md_header_splits[1]

Document(page_content=‘Hi this is Lance’, metadata={‘Header 1’: ‘Title’, ‘Header 2’: ‘Chapter 1’, ‘Header 3’: ‘Section’})

Splitting Notion Markdown

Load data

loader = NotionDirectoryLoader("../docs/Notion_DB")
docs = loader.load()

Join data

txt = ' '.join([d.page_content for d in docs])
txt

‘# Getting Started👋 Welcome to Notion!are the basics:- [ ] Click anywhere and just start typing- [ ] Hit / to see all the types of content you can add - headers, videos, sub pages, etc.(https://www.notion.so/Example-sub-page-92f63253929d456bbf12cd696e21e045?pvs=21)- [ ] Highlight any text, and use the menu that pops up to style your ~~writing~~ however you like- [ ] See the ⋮⋮ to the left of this checkbox on hover? Click and drag to move this line- [ ] Click the + New Page button at the bottom of your sidebar to add a new page- [ ] Click Templates in your sidebar to get started with pre-built pages- This is a toggle block. Click the little triangle to see more useful tips!- Template Gallery: More templates built by the Notion community- Help & Support: ****Guides and FAQs for everything in Notion- Stay organized with your sidebar and nested pages:it in action:(https://youtu.be/TL_N2pmh9O0) minute(https://youtu.be/FXIrojSK3Jo) minutes(https://youtu.be/2Pwzff-uffU) minutes(https://youtu.be/O8qdvSxDYNY) minutesour YouTube channel to watch 50+ more tutorials👉Have a question? Click the ? at the bottom right for more guides, or to send us a message.’

Define Splitter

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]

markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

Split text

md_header_splits = markdown_splitter.split_text(txt)

Output

md_header_splits[0]

Document(page_content=‘👋 Welcome to Notion! are the basics: - [ ] Click anywhere and just start typing- [ ] Hit / to see all the types of content you can add - headers, videos, sub pages, etc. (https://www.notion.so/Example-sub-page-92f63253929d456bbf12cd696e21e045?pvs=21) - [ ] Highlight any text, and use the menu that pops up to style your ~~writing~~ however you like- [ ] See the ⋮⋮ to the left of this checkbox on hover? Click and drag to move this line- [ ] Click the + New Page button at the bottom of your sidebar to add a new page- [ ] Click Templates in your sidebar to get started with pre-built pages- This is a toggle block. Click the little triangle to see more useful tips!- Template Gallery: More templates built by the Notion community- Help & Support: ****Guides and FAQs for everything in Notion- Stay organized with your sidebar and nested pages: it in action: (https://youtu.be/TL_N2pmh9O0) minute (https://youtu.be/FXIrojSK3Jo) minutes (https://youtu.be/2Pwzff-uffU) minutes (https://youtu.be/O8qdvSxDYNY) minutes our YouTube channel to watch 50+ more tutorials 👉Have a question? Click the ? at the bottom right for more guides, or to send us a message.’, metadata={‘Header 1’: ‘Getting Started’})

Acknowledgments

This tutorial is mainly based on the excellent course “LangChain: Chat with Your DataI” provided by Harrison Chase from LangChain and Andrew Ng from DeepLearning.AI.

What’s next?

Congratulations! You have completed this tutorial 👍

Next, you may want to go back to the lab’s website