Tagging and Extraction Using OpenAI functions

Tutorial 4

Jan Kirenz

Tagging and Extraction Using OpenAI functions

Setup

import os
import openai

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) 
openai.api_key = os.environ['OPENAI_API_KEY']
from pydantic import BaseModel, Field
from typing import List
from typing import Optional

from langchain.utils.openai_functions import convert_pydantic_to_openai_function
from langchain.prompts import ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.output_parsers.openai_functions import JsonOutputFunctionsParser
from langchain.output_parsers.openai_functions import JsonKeyOutputFunctionsParser
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema.runnable import RunnableLambda

Tagging

Create tagging class

class Tagging(BaseModel):
    """Tag the piece of text with particular info."""
    sentiment: str = Field(description="sentiment of text, should be `pos`, `neg`, or `neutral`")
    language: str = Field(description="language of text (should be ISO 639-1 code)")

Take a look at the class

convert_pydantic_to_openai_function(Tagging)

{‘name’: ‘Tagging’, ‘description’: ‘Tag the piece of text with particular info.’, ‘parameters’: {‘title’: ‘Tagging’, ‘description’: ‘Tag the piece of text with particular info.’, ‘type’: ‘object’, ‘properties’: {‘sentiment’: {‘title’: ‘Sentiment’, ‘description’: ‘sentiment of text, should be pos, neg, or neutral’, ‘type’: ‘string’}, ‘language’: {‘title’: ‘Language’, ‘description’: ‘language of text (should be ISO 639-1 code)’, ‘type’: ‘string’}}, ‘required’: [‘sentiment’, ‘language’]}}

Create model, tagging function and prompt

model = ChatOpenAI(temperature=0)
tagging_functions = [convert_pydantic_to_openai_function(Tagging)]
prompt = ChatPromptTemplate.from_messages([
    ("system", "Think carefully, and then tag the text as instructed"),
    ("user", "{input}")
])

Bind model to tagging function and create chain

We force the model to use the tagging functions

model_with_functions = model.bind(
    functions=tagging_functions,
    function_call={"name": "Tagging"}
)
tagging_chain = prompt | model_with_functions

Call the function with example 1

tagging_chain.invoke({"input": "I like the book Sapiens"})
  • AIMessage(content=’‘, additional_kwargs={’function_call’: {‘name’: ‘Tagging’, ‘arguments’: ‘{“sentiment”: “pos”,“language”: “en”}’}})

Call the function with example 2

tagging_chain.invoke({"input": "Das 'Buch Eine Anleitung zum guten Leben: Wie Sie die alte Kunst des Stoizismus' ist sehr lesenswert"})
  • AIMessage(content=’‘, additional_kwargs={’function_call’: {‘name’: ‘Tagging’, ‘arguments’: ‘{“sentiment”: “pos”,“language”: “de”}’}})

Use output parser

  • Obtain a cleaner result with JsonOutputFunctionsParser()
tagging_chain = prompt | model_with_functions | JsonOutputFunctionsParser()
tagging_chain.invoke({"input": "Das 'Buch Eine Anleitung zum guten Leben: Wie Sie die alte Kunst des Stoizismus' ist sehr lesenswert"})
  • {‘sentiment’: ‘pos’, ‘language’: ‘de’}

Extraction

Extraction is similar to tagging, but used for extracting multiple pieces of information.

Define class

class Person(BaseModel):
    """Information about a person."""
    name: str = Field(description="person's name")
    age: Optional[int] = Field(description="person's age")
class Information(BaseModel):
    """Information to extract."""
    people: List[Person] = Field(description="List of info about people")

Convert Pydantic to OpenAI function

convert_pydantic_to_openai_function(Information)

{‘name’: ‘Information’, ‘description’: ‘Information to extract.’, ‘parameters’: {‘title’: ‘Information’, ‘description’: ‘Information to extract.’, ‘type’: ‘object’, ‘properties’: {‘people’: {‘title’: ‘People’, ‘description’: ‘List of info about people’, ‘type’: ‘array’, ‘items’: {‘title’: ‘Person’, ‘description’: ‘Information about a person.’, ‘type’: ‘object’, ‘properties’: {‘name’: {‘title’: ‘Name’, ‘description’: “person’s name”, ‘type’: ‘string’}, ‘age’: {‘title’: ‘Age’, ‘description’: “person’s age”, ‘type’: ‘integer’}}, ‘required’: [‘name’]}}}, ‘required’: [‘people’]}}

Set up model

extraction_functions = [convert_pydantic_to_openai_function(Information)]

extraction_model = model.bind(functions=extraction_functions, function_call={"name": "Information"})

Test model

extraction_model.invoke("Joe is 30, his mom is Martha")
  • AIMessage(content=’‘, additional_kwargs={’function_call’: {‘name’: ‘Information’, ‘arguments’: ‘{“people”: [,]}’}})

  • Model inputs age 0 if age isn’t provided

Update prompt

prompt = ChatPromptTemplate.from_messages([
    ("system", "Extract the relevant information, if not explicitly provided do not guess. Extract partial info"),
    ("human", "{input}")
])
extraction_chain = prompt | extraction_model
extraction_chain.invoke({"input": "Joe is 30, his mom is Martha"})
  • AIMessage(content=’‘, additional_kwargs={’function_call’: {‘name’: ‘Information’, ‘arguments’: ‘{“people”: [,]}’}})

Parse output

extraction_chain = prompt | extraction_model | JsonOutputFunctionsParser()
extraction_chain.invoke({"input": "Joe is 30, his mom is Martha"})
  • {‘people’: [{‘name’: ‘Joe’, ‘age’: 30}, {‘name’: ‘Martha’}]}

Use different output parser

  • Use JsonKeyOutputFunctionsParser()to only extract relevant info
extraction_chain = prompt | extraction_model | JsonKeyOutputFunctionsParser(key_name="people")
extraction_chain.invoke({"input": "Joe is 30, his mom is Martha"})
  • [{‘name’: ‘Joe’, ‘age’: 30}, {‘name’: ‘Martha’}]

Blog post example

We can apply tagging and axtracting to a larger body of text.

Load document

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")

documents = loader.load()
doc = documents[0]

Inspect content

page_content = doc.page_content[:10000]
print(page_content[:1000])
LLM Powered Autonomous Agents | Lil'Log

Lil'Log

Posts

Archive

Search

Tags

FAQ

emojisearch.app

      LLM Powered Autonomous Agents
    
June 23, 2023 · 31 min · Lilian Weng

Table of Contents
Agent System Overview
Component One: Planning
Task Decomposition
Self-Reflection
Component Two: Memory
Types of Memory
Maximum Inner Product Search (MIPS)
Component Three: Tool Use
Case Studies
Scientific Discovery Agent
Generative Agents Simulation
Proof-of-Concept Examples
Challenges
Citation
References

Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview#
In

Blog post tagging

Create class to create article overview and tags

class Overview(BaseModel):
    """Overview of a section of text."""
    summary: str = Field(description="Provide a concise summary of the content.")
    language: str = Field(description="Provide the language that the content is written in.")
    keywords: str = Field(description="Provide keywords related to the content.")

Setup the chain

overview_tagging_function = [
    convert_pydantic_to_openai_function(Overview)
]

tagging_model = model.bind(
    functions=overview_tagging_function,
    function_call={"name":"Overview"}
)

tagging_chain = prompt | tagging_model | JsonOutputFunctionsParser()

Invoke chain

tagging_chain.invoke({"input": page_content})
  • {‘summary’: ‘This article discusses the concept of building autonomous agents powered by LLM (large language model) as their core controller. It explores the key components of such agents, including planning, memory, and tool use. It also covers various techniques for task decomposition and self-reflection in autonomous agents.’, ‘language’: ‘English’, ‘keywords’: ‘LLM, autonomous agents, planning, memory, tool use, task decomposition, self-reflection’}

Blog post extraction

Define class to extract papers

class Paper(BaseModel):
    """Information about papers mentioned."""
    title: str
    author: Optional[str]


class Info(BaseModel):
    """Information to extract"""
    papers: List[Paper]

Setup extraction chain

paper_extraction_function = [
    convert_pydantic_to_openai_function(Info)
]

extraction_model = model.bind(
    functions=paper_extraction_function, 
    function_call={"name":"Info"}
)

extraction_chain = prompt | extraction_model | JsonKeyOutputFunctionsParser(key_name="papers")

Invoke chain

extraction_chain.invoke({"input": page_content})
  • [{‘title’: ‘LLM Powered Autonomous Agents’, ‘author’: ‘Lilian Weng’}]

Update sytem message

template = """A article will be passed to you. Extract from it all papers that are mentioned by this article. 

Do not extract the name of the article itself. If no papers are mentioned that's fine - you don't need to extract any! Just return an empty list.

Do not make up or guess ANY extra information. Only extract what exactly is in the text."""

prompt = ChatPromptTemplate.from_messages([
    ("system", template),
    ("human", "{input}")
])

Set up chain

extraction_chain = prompt | extraction_model | JsonKeyOutputFunctionsParser(key_name="papers")

Invoke chain

extraction_chain.invoke({"input": page_content})
  • [{‘title’: ‘Chain of thought (CoT; Wei et al. 2022)’, ‘author’: ‘Wei et al.’}, {‘title’: ‘Tree of Thoughts (Yao et al. 2023)’, ‘author’: ‘Yao et al.’}, {‘title’: ‘LLM+P (Liu et al. 2023)’, ‘author’: ‘Liu et al.’}, {‘title’: ‘ReAct (Yao et al. 2023)’, ‘author’: ‘Yao et al.’}, {‘title’: ‘Reflexion (Shinn & Labash 2023)’, ‘author’: ‘Shinn & Labash’}, {‘title’: ‘Chain of Hindsight (CoH; Liu et al. 2023)’, ‘author’: ‘Liu et al.’}, {‘title’: ‘Algorithm Distillation (AD; Laskin et al. 2023)’, ‘author’: ‘Laskin et al.’}]

Test chain

extraction_chain.invoke({"input": "hi"})
  • []

Extraction for the complete blog post

Split the text

text_splitter = RecursiveCharacterTextSplitter(chunk_overlap=0)
splits = text_splitter.split_text(doc.page_content)
len(splits)
  • 14

Create function to join the lists

def flatten(matrix):
    flat_list = []
    for row in matrix:
        flat_list += row
    return flat_list
  • Test the function
flatten([[1, 2], [3, 4]])
  • [1, 2, 3, 4]

Take a look at the splits

  • The splits are just text.
  • We need to convert them to a dictionary where the text is the input key.
print(splits[0])

Use RunnableLambda to create function

prep = RunnableLambda(
    lambda x: [{"input": doc} for doc in text_splitter.split_text(x)]
)
  • Test function
prep.invoke("hi")
  • [{‘input’: ‘hi’}]

Create chain

chain = prep | extraction_chain.map() | flatten
  • extraction_chain operates over a single element

  • However, we have a list of elements

  • Therefore, we call .map()

Invoke chain

chain.invoke(doc.page_content)
[{'title': 'AutoGPT', 'author': ''},
 {'title': 'GPT-Engineer', 'author': ''},
 {'title': 'BabyAGI', 'author': ''},
 {'title': 'Chain of thought (CoT; Wei et al. 2022)', 'author': 'Wei et al.'},
 {'title': 'Tree of Thoughts (Yao et al. 2023)', 'author': 'Yao et al.'},
 {'title': 'LLM+P (Liu et al. 2023)', 'author': 'Liu et al.'},
 {'title': 'ReAct (Yao et al. 2023)', 'author': 'Yao et al.'},
 {'title': 'Reflexion (Shinn & Labash 2023)', 'author': 'Shinn & Labash'},
 {'title': 'Reflexion framework', 'author': 'Shinn & Labash'},
 {'title': 'Chain of Hindsight', 'author': 'Liu et al. 2023'},
 {'title': 'Algorithm Distillation', 'author': 'Laskin et al. 2023'},
 {'title': 'Algorithm Distillation', 'author': 'Laskin et al. 2023'},
 {'title': 'ED (expert distillation)', 'author': ''},
 {'title': 'RL^2', 'author': 'Duan et al. 2017'},
 {'title': 'LSH: Locality-Sensitive Hashing', 'author': ''},
 {'title': 'ANNOY: Approximate Nearest Neighbors Oh Yeah', 'author': ''},
 {'title': 'HNSW: Hierarchical Navigable Small World', 'author': ''},
 {'title': 'FAISS: Facebook AI Similarity Search', 'author': ''},
 {'title': 'ScaNN: Scalable Nearest Neighbors', 'author': ''},
 {'title': 'MRKL: Modular Reasoning, Knowledge and Language',
  'author': 'Karpas et al. 2022'},
 {'title': 'TALM: Tool Augmented Language Models',
  'author': 'Parisi et al. 2022'},
 {'title': 'Toolformer', 'author': 'Schick et al. 2023'},
 {'title': 'HuggingGPT', 'author': 'Shen et al. 2023'},
 {'title': 'API-Bank', 'author': 'Li et al. 2023'},
 {'title': 'ChemCrow', 'author': 'Bran et al. 2023'},
 {'title': 'Boiko et al. (2023)', 'author': 'Boiko et al.'},
 {'title': 'Generative Agents Simulation', 'author': 'Park, et al. 2023'},
 {'title': 'Park et al. 2023', 'author': ''},
 {'title': 'Super Mario: How Nintendo Conquered America',
  'author': 'Jeff Ryan'},
 {'title': 'Model-View-Controller (MVC) Explained', 'author': 'Techopedia'},
 {'title': 'Python Game Development: Creating a Snake Game',
  'author': 'Real Python'},
 {'title': 'Paper A', 'author': 'Author A'},
 {'title': 'Paper B', 'author': 'Author B'},
 {'title': 'Paper C', 'author': 'Author C'},
 {'title': 'Chain of thought prompting elicits reasoning in large language models',
  'author': 'Wei et al.'},
 {'title': 'Tree of Thoughts: Deliberate Problem Solving with Large Language Models',
  'author': 'Yao et al.'},
 {'title': 'Chain of Hindsight Aligns Language Models with Feedback',
  'author': 'Liu et al.'},
 {'title': 'LLM+P: Empowering Large Language Models with Optimal Planning Proficiency',
  'author': 'Liu et al.'},
 {'title': 'ReAct: Synergizing reasoning and acting in language models',
  'author': 'Yao et al.'},
 {'title': 'Reflexion: an autonomous agent with dynamic memory and self-reflection',
  'author': 'Shinn & Labash'},
 {'title': 'In-context Reinforcement Learning with Algorithm Distillation',
  'author': 'Laskin et al.'},
 {'title': 'MRKL Systems A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning',
  'author': 'Karpas et al.'},
 {'title': 'API-Bank: A Benchmark for Tool-Augmented LLMs',
  'author': 'Li et al.'},
 {'title': 'HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace',
  'author': 'Shen et al.'},
 {'title': 'ChemCrow: Augmenting large-language models with chemistry tools',
  'author': 'Bran et al.'},
 {'title': 'Emergent autonomous scientific research capabilities of large language models',
  'author': 'Boiko et al.'},
 {'title': 'Generative Agents: Interactive Simulacra of Human Behavior',
  'author': 'Joon Sung Park, et al.'}]

Acknowledgments

What’s next?

Congratulations! You have completed this tutorial 👍

Next, you may want to go back to the lab’s website