[‘a b c d e f g h i j k l m’, ‘l m n o p q r s t u v w x’, ‘w x y z’]
Recursive Splitting Details
RecursiveCharacterTextSplitter
RecursiveCharacterTextSplitter is recommended for generic text.
some_text ="""When writing documents, writers will use document structure to group content. \This can convey to the reader, which idea's are related. For example, closely related ideas \are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n\Paragraphs are often delimited with a carriage return or two carriage returns. \Carriage returns are the "backslash n" you see embedded in this string. \Sentences have a period at the end, but also, have a space.\and words are separated by space."""
[‘When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the “backslash n” you see embedded in this string. Sentences have a period at the end, but also,’, ‘have a space.and words are separated by space.’]
Recursive Splitter output
r_splitter.split_text(some_text)
[“When writing documents, writers will use document structure to group content. This can convey to the reader, which idea’s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.”, ‘Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the “backslash n” you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.’]
Adapt splitter 1
Let’s reduce the chunk size a bit and add a period to our separators:
[“When writing documents, writers will use document structure to group content. This can convey to the reader, which idea’s are related”, ‘. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.’, ‘Paragraphs are often delimited with a carriage return or two carriage returns’, ‘. Carriage returns are the “backslash n” you see embedded in this string’, ‘. Sentences have a period at the end, but also, have a space.and words are separated by space.’]
[“When writing documents, writers will use document structure to group content. This can convey to the reader, which idea’s are related.”, ‘For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.’, ‘Paragraphs are often delimited with a carriage return or two carriage returns.’, ‘Carriage returns are the “backslash n” you see embedded in this string.’, ‘Sentences have a period at the end, but also, have a space.and words are separated by space.’]
my name’s Andrew Ng and I’ll be instru ctor for this class. And so I personally work in machine learning, and I’ ve worked on it for about 15 years now, and I actually think that machine learning is th e most exciting field of all the computer sciences. So I’m actually always excited about teaching this class. Sometimes I actually think that machine learning is not only the most exciting thin g in computer science, but the most exciting thing in all of human e ndeavor, so maybe a little b
# Getting Started👋 Welcome to Notion!Here are the basics:- [ ] Click anywhere and just start typing- [ ] Hit `/` to see all the types of content you can add - headers, videos, sub pages, etc.[Example sub page](https://www.notion.so/Example-sub-page-92f63253929d456bbf12cd696e21e045?pvs=21)- [ ] Highlight any text, and use the menu that pops up to **style** *your* ~~writing~~ `however`[you](https://www.notion.so/product) like- [ ] See the `⋮⋮` to the left of this checkbox on hover? Click and drag to move this line- [ ] Click the `+ New Page` button at the bottom of your sidebar to add a new page- [ ] Click `Templates` in your sidebar to get started with pre-built pages- This is a toggle block. Click the little triangle to see more useful tips! - [Template Gallery](https://www.notion.so/181e961aeb5c4ee6915307c0dfd5156d?pvs=21): More templates built by the Notion community
Token splitting
Basics
We can also split on token count explicity, if we want
This can be useful because LLMs often have context windows designated in tokens
Chunking aims to keep text with common context together.
A text splitting often uses sentences or other delimiters to keep related text together but many documents (such as Markdown) have structure (headers) that can be explicitly used in splitting.
We can use MarkdownHeaderTextSplitter to preserve header metadata in our chunks
Markdown example
markdown_document ="""# Title\n\n\## Chapter 1\n\n\Hi this is Jim\n\n Hi this is Joe\n\n\### Section \n\n\Hi this is Lance \n\n## Chapter 2\n\n\Hi this is Molly"""
‘# Getting Started👋 Welcome to Notion!are the basics:- [ ] Click anywhere and just start typing- [ ] Hit / to see all the types of content you can add - headers, videos, sub pages, etc.(https://www.notion.so/Example-sub-page-92f63253929d456bbf12cd696e21e045?pvs=21)- [ ] Highlight any text, and use the menu that pops up to styleyourwritinghoweveryou like- [ ] See the ⋮⋮ to the left of this checkbox on hover? Click and drag to move this line- [ ] Click the + New Page button at the bottom of your sidebar to add a new page- [ ] Click Templates in your sidebar to get started with pre-built pages- This is a toggle block. Click the little triangle to see more useful tips!- Template Gallery: More templates built by the Notion community- Help & Support: ****Guides and FAQs for everything in Notion- Stay organized with your sidebar and nested pages:it in action:(https://youtu.be/TL_N2pmh9O0) minute(https://youtu.be/FXIrojSK3Jo) minutes(https://youtu.be/2Pwzff-uffU) minutes(https://youtu.be/O8qdvSxDYNY) minutesour YouTube channel to watch 50+ more tutorials👉Have a question? Click the ? at the bottom right for more guides, or to send us a message.’
Document(page_content=‘👋 Welcome to Notion! are the basics: - [ ] Click anywhere and just start typing- [ ] Hit / to see all the types of content you can add - headers, videos, sub pages, etc. (https://www.notion.so/Example-sub-page-92f63253929d456bbf12cd696e21e045?pvs=21) - [ ] Highlight any text, and use the menu that pops up to styleyourwritinghoweveryou like- [ ] See the ⋮⋮ to the left of this checkbox on hover? Click and drag to move this line- [ ] Click the + New Page button at the bottom of your sidebar to add a new page- [ ] Click Templates in your sidebar to get started with pre-built pages- This is a toggle block. Click the little triangle to see more useful tips!- Template Gallery: More templates built by the Notion community- Help & Support: ****Guides and FAQs for everything in Notion- Stay organized with your sidebar and nested pages: it in action: (https://youtu.be/TL_N2pmh9O0) minute (https://youtu.be/FXIrojSK3Jo) minutes (https://youtu.be/2Pwzff-uffU) minutes (https://youtu.be/O8qdvSxDYNY) minutes our YouTube channel to watch 50+ more tutorials 👉Have a question? Click the ? at the bottom right for more guides, or to send us a message.’, metadata={‘Header 1’: ‘Getting Started’})
Acknowledgments
This tutorial is mainly based on the excellent course “LangChain: Chat with Your DataI” provided by Harrison Chase from LangChain and Andrew Ng from DeepLearning.AI.
What’s next?
Congratulations! You have completed this tutorial 👍