Getting Started with Transformers

HuggingFace

Jan Kirenz

Transformers intro

Learning goals

Transformer neural networks can be used to tackle a wide range of tasks in natural language processing and beyond.
Transfer learning allows one to adapt Transformers to specific tasks.
The pipeline() function from the transformers library can be used to run inference with models from the Hugging Face Hub.

Why Transformers?

Deep learning is currently undergoing a period of rapid progress across a wide variety of domains, including:
📖 Natural language processing
👀 Computer vision
🔊 Audio
and many more!
The main driver of these breakthroughs is the Transformer – a novel neural network developed by Google researchers in 2017.

Transformers examples

💻 They can generate code as in products like GitHub Copilot, which is based on OpenAI’s family of GPT models.
❓ They can be used for improve search engines, like Google did with a Transformer called BERT.
🗣️ They can process speech in multiple languages to perform speech recognition, speech translation, and language identification. For example, Facebook’s XLS-R model can automatically transcribe audio in one language to another!

Transfer learning

Training Transformer models from scratch involves a lot of resources (compute, data, and days to train)
With transfer learning, it is possible to adapt a model that has been trained from scratch (usually called a pretrained model) for a new, but similar task.

Fine tuning

Fine-tuning can be used as a special case of transfer learning where you use new data to continue training the model on the new task.
The models that we’ll be looking at in this tutorial are all examples of fine-tuned models

Transfer learning

You can learn more about the transfer learning process in the video below:

Hugging Face Transformers

The Hugging Face Transformers library provides a unified API across dozens of Transformer architectures, as well as the means to train models and run inference with them.

Pipelines for Transformers

The fastest way to learn what Transformers can do is via the pipeline() function.
This function loads a model from the Hugging Face Hub and takes care of all the preprocessing and postprocessing steps that are needed to convert inputs into predictions:

Alt text that describes the graphic

What happens inside the pipeline function?

Setup

Import the pipeline:

from transformers import pipeline

Text input

We need a snippet of text for our models to analyze, so let’s use the following (fictious!) customer feedback about a certain online order:

text = """Dear Amazon, last week I ordered an Optimus Prime action figure \
from your online store in Germany. Unfortunately, when I opened the package, \
I discovered to my horror that I had been sent an action figure of Megatron \
instead! As a lifelong enemy of the Decepticons, I hope you can understand my \
dilemma. To resolve the issue, I demand an exchange of Megatron for the \
Optimus Prime figure I ordered. Enclosed are copies of my records concerning \
this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""

Create text wrapper

Let’s create a simple wrapper so that we can pretty print out texts:

import textwrap

wrapper = textwrap.TextWrapper(
            width=80, 
            break_long_words=False, 
            break_on_hyphens=False
          )

Text classification

Let’s start with one of the most common tasks in NLP: text classification

Analyze sentiment

Now suppose that we’d like to predict the sentiment of this text, i.e. whether the feedback is positive or negative.
This is a special type of text classification that is often used in industry to aggregate customer feedback across products or services.

Tokens

The example below shows how a Transformer like BERT converts the inputs into atomic chunks called tokens which are then fed through the network to produce a single prediction:

Pipeline

We need to specify the task in the pipeline() function as follows;

sentiment_pipeline = pipeline('text-classification')

When you run this code, you’ll see a message about which Hub model is being used by default.
In this case, the pipeline() function loads the distilbert-base-uncased-finetuned-sst-2-english model, which is a small BERT variant trained on SST-2 which is a sentiment analysis dataset.

Note

💡 The first time you execute the code, the model will be automatically downloaded from the Hub and cached for later use!

Run pipeline

Now we are ready to run our example through pipeline and look at some predictions:

sentiment_pipeline(text)

Output: [{‘label’: ‘NEGATIVE’, ‘score’: 0.9015464186668396}]
The model predicts negative sentiment with a high confidence which makes sense given that we have a disgruntled customer.
You can also see that the pipeline returns a list of Python dictionaries with the predictions.
We can also pass several texts at the same time in which case we would get several dicts in the list for each text one.

Named entity recognition (NER)

Basics

Instead of just finding the overall sentiment, let’s see if we can extract entities such as organizations, locations, or individuals from the text.
This task is called named entity recognition, or NER for short.

Predict class for echa token

Instead of predicting just a class for the whole text a class is predicted for each token, as shown in the example below:

Alt text that describes the graphic

Pipeline

We just load a pipeline for NER without specifying a model.
This will load a default BERT model that has been trained on the CoNLL-2003 dataset:

ner_pipeline = pipeline('ner')

Merge entities

When we pass our text through the model, we now get a long list of Python dictionaries, where each dictionary corresponds to one detected entity.
Since multiple tokens can correspond to a a single entity, we can apply an aggregation strategy that merges entities if the same class appears in consequtive tokens:

entities = ner_pipeline(text, aggregation_strategy="simple")
print(entities)

Output: [{‘entity_group’: ‘ORG’, ‘score’: 0.87900954, ‘word’: ‘Amazon’, ‘start’: 5, ‘end’: 11}, {‘entity_group’: ‘MISC’, ‘score’: 0.9908588, ‘word’: ‘Optimus Prime’, ‘start’: 36, ‘end’: 49}, {‘entity_group’: ‘LOC’, ‘score’: 0.9997547, ‘word’: ‘Germany’, ‘start’: 90, ‘end’: 97}, {‘entity_group’: ‘MISC’, ‘score’: 0.55656713, ‘word’: ‘Mega’, ‘start’: 208, ‘end’: 212}, {‘entity_group’: ‘PER’, ‘score’: 0.5902563, ‘word’: ‘##tron’, ‘start’: 212, ‘end’: 216}, {‘entity_group’: ‘ORG’, ‘score’: 0.6696913, ‘word’: ‘Decept’, ‘start’: 253, ‘end’: 259}, {‘entity_group’: ‘MISC’, ‘score’: 0.4983487, ‘word’: ‘##icons’, ‘start’: 259, ‘end’: 264}, {‘entity_group’: ‘MISC’, ‘score’: 0.77536064, ‘word’: ‘Megatron’, ‘start’: 350, ‘end’: 358}, {‘entity_group’: ‘MISC’, ‘score’: 0.987854, ‘word’: ‘Optimus Prime’, ‘start’: 367, ‘end’: 380}, {‘entity_group’: ‘PER’, ‘score’: 0.81209683, ‘word’: ‘Bumblebee’, ‘start’: 502, ‘end’: 511}]

Clean the output

This isn’t very easy to read, so let’s clean up the outputs a bit:

for entity in entities:
    print(f"{entity['word']}: {entity['entity_group']} ({entity['score']:.2f})")

Amazon: ORG (0.88)  
Optimus Prime: MISC (0.99)  
Germany: LOC (1.00)  
Mega: MISC (0.56)  
##tron: PER (0.59)  
Decept: ORG (0.67)  
##icons: MISC (0.50)  
Megatron: MISC (0.78)  
Optimus Prime: MISC (0.99)  
Bumblebee: PER (0.81)

Findings

It seems that the model found most of the named entities but was confused about “Megatron” andn “Decepticons”, which are characters in the transformers franchise.
This is no surprise since the original dataset probably did not contain many transformer characters. For this reason it makes sense to further fine-tune a model on your on dataset!

Question answering

Basics

In this task, the model is given a question and a context and needs to find the answer to the question within the context.
This problem can be rephrased as a classification problem: For each token the model needs to predict whether it is the start or the end of the answer.

Basics

In the end we can extract the answer by looking at the span between the token with the highest start probability and highest end probability:

Alt text that describes the graphic

Pipeline

we load the model by specifying the task in the pipeline() function:

qa_pipeline = pipeline("question-answering")

This default model is trained on the famous SQuAD dataset.

Ask question

Let’s see if we can ask it what the customer wants:

question = "What does the customer want?"

outputs = qa_pipeline(question=question, context=text)
outputs

{'score': 0.6312916874885559,
 'start': 335,
 'end': 358,
 'answer': 'an exchange of Megatron'}

Text summarization

Basics

Generation is much more computationally demanding since we usually generate one token at a time and need to run this several times.
An example for how this process works is shown below:

Alt text that describes the graphic

Pipeline

A popular task involving generation is summarization

summarization_pipeline = pipeline("summarization")

This model was trained on the CNN/Dailymail dataset to summarize news articles.

Output

outputs = summarization_pipeline(text, max_length=80, clean_up_tokenization_spaces=True)

print(wrapper.fill(outputs[0]['summary_text']))

Output: Bumblebee ordered an Optimus Prime action figure from your online store in Germany. Unfortunately, when I opened the package, I discovered to my horror that I had been sent an action figure of Megatron instead. As a lifelong enemy of the Decepticons, I hope you can understand my dilemma.

Translation

Basics

But what if there is no model in the language of my data?
You can still try to translate the text.
The Helsinki NLP team has provided over 1,000 language pair models for translation.

Pipeline

Translate English to German:

translator = pipeline("translation_en_to_de", model="Helsinki-NLP/opus-mt-en-de")

Output

Let’s translate our text to German:

outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)

print(wrapper.fill(outputs[0]['translation_text']))

Output: Sehr geehrter Amazon, letzte Woche habe ich eine Optimus Prime Action Figur aus Ihrem Online-Shop in Deutschland bestellt. Leider, als ich das Paket öffnete, entdeckte ich zu meinem Entsetzen, dass ich stattdessen eine Action Figur von Megatron geschickt worden war! Als lebenslanger Feind der Decepticons, Ich hoffe, Sie können mein Dilemma verstehen. Um das Problem zu lösen, Ich fordere einen Austausch von Megatron für die Optimus Prime Figur habe ich bestellt. Eingeschlossen sind Kopien meiner Aufzeichnungen über diesen Kauf. Ich erwarte, von Ihnen bald zu hören. Aufrichtig, Bumblebee.

Findings

We can see that the text is clearly not perfectly translated, but the core meaning stays the same.
Another application of translation models is data augmentation via backtranslation

Zero-shot classification

Basics

In zero-shot classification the model receives a text and a list of candidate labels and determines which labels are compatible with the text.
Instead of having fixed classes this allows for flexible classification without any labelled data!
Usually this is a good first baseline!

Pipeline

zero_shot_classifier = pipeline("zero-shot-classification",
                                model="vicgalle/xlm-roberta-large-xnli-anli")

Text input

Let’s have a look at an example:

text = 'Dieses Tutorial ist großartig! Ich hoffe, dass jemand von Hugging Face meine Hochschule besuchen wird :)'

classes = ['Treffen', 'Arbeit', 'Digital', 'Reisen']

Pipeline

zero_shot_classifier(text, classes, multi_label=True)

{'sequence': 'Dieses Tutorial ist großartig! Ich hoffe, dass jemand von Hugging Face meine Hochschule besuchen wird :)',
 'labels': ['Digital', 'Arbeit', 'Treffen', 'Reisen'],
 'scores': [0.7426563501358032,
  0.6590237021446228,
  0.517701268196106,
  0.011237525381147861]}

For longer and more domain specific examples this approach might suffer.

Going beyond text

Transformers can also be used for domains other than NLP!

Basics

There are many more pipelines that you can experiment with

from transformers import pipelines
for task in pipelines.SUPPORTED_TASKS:
    print(task)

audio-classification
automatic-speech-recognition
feature-extraction
text-classification
token-classification
question-answering
table-question-answering
visual-question-answering
document-question-answering
fill-mask
summarization
translation
text2text-generation
text-generation
zero-shot-classification
zero-shot-image-classification
zero-shot-audio-classification
conversational
image-classification
image-segmentation
image-to-text
object-detection
zero-shot-object-detection
depth-estimation
video-classification
mask-generation

Computer vision

Transformer models have also entered computer vision. Check out the DETR model on the Hub:

Alt text that describes the graphic

Audio

Another promising area is audio processing (especially Speech2Text)
See for example the wav2vec2 model:

Alt text that describes the graphic

Table QA

Finally, a lot of real world data is still in form of tables.
Being able to query tables is very useful and with TAPAS you can do tabular question-answering:

Alt text that describes the graphic

Resources

👉 click here to access hands-on Transformers exercices

What’s next?

Congratulations! You have completed this tutorial 👍

Next, you may want to go back to the lab’s website

The slides are mainly based on a toolkit provided by Hugging Face’s Lewis Tunstall and the book Natural Language Processing with Transformers.