Hugging Face Hub 🤗

Introduction

Jan Kirenz

Introduction

Hugging FAce is a data science and community platform that provides tools to easily build, train and deploy ML models

What is the Hub?

The Hub is a free platform where anyone can share and explore models, datasets, and ML demos.

Hugging Face

Over 300,000 public models.
Models for Natural Language Processing, Computer Vision, Audio/Speech, and Reinforcement Learning
Models for over 180 languages.

Learning Goals

Learning Goals

Explore models shared on the Hub.
Find suitable models and datasets for your task.
How to contribute and work collaboratively.
Explore ML demos created by the community.

Models

Exploring a model

You can access over 300,000 models at hf.co/models.
You will see gpt2 as one of the models with the most downloads. Let’s click on it.
The website will take you to the model card when you click a model.

What is the Model Card?

A model card is a tool that
- documents models,
- provides helpful information about the models
- is essential for discoverability and reproducibility

Inference API

At the right column, you can play with the model directly in the browser using the Inference API.
GPT2 is a text generation model, so it will generate additional text given an initial input.
Try typing something like, “It was a bright and sunny day.”

Model Card Content

In the middle, you can go through the model card content.
It has sections such as Intended uses & limitations, Training procedure, and Citation Info.

Where does the data come from?

At Hugging Face, everything is based in Git repositories and is open-sourced.
You can click the “Files and Versions” tab, which will allow you to see all the repository files, including the model weights.
The model card is a markdown file (README.md) which on top of the content contains metadata such as the tags.
Just as with GitHub, you can do things such as Git cloning, adding, committing, branching, and pushing.

Take a look at config.json

Open the config.json file of the GPT2 repository.
The config file contains hyperparameters as well as useful information for loading the model.

Exploring Models

Filter

At the left of https://huggingface.co/models, you can filter for different things:
- Tasks: Computer Vision, Natural Language Processing, Audio, and more.
- Libraries: You can find models of Keras, PyTorch, spaCy, allenNLP, and more.
- Datasets: The Hub also hosts thousands of datasets, as you’ll find more about later.
- Languages: Many of the models on the Hub are NLP-related. You can find models for hundreds of languages.

Adding a model

Learn how to upload a model to the Hub.

Workflow

Go to huggingface.co/new to create a new model repository.
You start with a public repo that has a model card.
You can upload your model either by using the Web UI or by doing it with Git.

Wub UI: You can click Add File and drag and drop the files you want to add.

Note

Take a look at the appendix to learn how to use Git

Datasets

Basics

The Hub hosts around 3000 datasets that are open-sourced and free to use in multiple domains.
On top of it, the open-source datasets library allows the easy use of these datasets
Similar to models, you can head to https://hf.co/datasets. At the left, you can find different filters based on the task, license, and size of the dataset.

GLUE datset

Let’s explore the GLUE dataset, which is a famous dataset used to test the performance of NLP models.
Similar to model repositories, you have a dataset card that documents the dataset. If you scroll down a bit, you will find things such as the summary, the structure, and more.

Dataset slice

At the top, you can explore a slice of the dataset directly in the browser.
The GLUE dataset is divided into multiple sub-datasets (or subsets) that you can select, such as COLA and QNLI.

Models trained on the dataset

At the right of the dataset card, you can see a list of models trained on this dataset.

ML Demos

How to create an interactive, publicly available demo

Basics

Demos of models are an increasingly important part of the ecosystem.

Demos allow:

model developers to easily present their work to a wide audience
to increase reproducibility in machine learning by lowering the barrier to test a model
to share with a non-technical audience the impact of a model

Frameworks

There are Open-Source Python frameworks such as Gradio and Streamlit that allow building these demos very easily,
Tools such as Hugging Face Spaces allow to host and share them.

What’s next?

Congratulations! You have completed this tutorial 👍

Next, you may want to go back to the lab’s website

Acknowledgments: The slides are mainly based on a toolkit provided by HuggingFace

Appendix: Git approach

Git approach

If you want to understand the complete workflow how to upload models, let’s go with the Git approach.

Install both git and git-lfs on your system.

Git: https://git-scm.com/book/en/v2/Getting-Started-Installing-Git
Git-lfs: https://git-lfs.github.com/. Large files need to be uploaded with Git LFS. Git does not work well once your files are above a few megabytes, which is frequent in ML. ML models can be up to gigabytes or terabytes!

Clone the repository you just created

git clone https://huggingface.co/<your-username>/<your-model-id>

Go to the directory and initialize Git LFS

Git approach (optional)

HuggingFace already provides a list of common file extensions for the large files in .gitattributes

If the files you want to upload are not included in the .gitattributes file, you might need as shown here: You can do so with:

git lfs track "*.your_extension"

git lfs install

Git approach

Add your files to the repository. The files depend on the framework/libraries you’re using. Overall, what is important is that you provide all artifacts required to load the model. For example:

For TensorFlow, you might want to upload a SavedModel or h5 file.
For PyTorch, usually, it’s a pytorch_model.bin.
For Scikit-Learn, it’s usually a joblib file.

Here is an example in Python saving a Scikit-Learn model file.

from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2]
from joblib import dump, load
dump(reg, 'model.joblib')

Git approach

Commit and push your files. (make sure the saved file is within the repository). Use GitHub Desktop or:

git add .
git commit -m "First model version"
git push

And we’re done! You can check your repository with all the recently added files!

The UI allows you to explore the model files and commits and to see the diff introduced by each commit.

Hugging Face Hub 🤗

Introduction

What is the Hub?

Learning Goals

Models

Exploring a model

What is the Model Card?

Tags

Inference API

Model Card Content

Where does the data come from?

Take a look at config.json

Exploring Models

Filter

Adding a model

Workflow

Share and collaboarate

Datasets

Basics

GLUE datset

Dataset slice

Models trained on the dataset

ML Demos

Basics

Frameworks

What’s next?

Appendix: Git approach

Git approach

Git approach (optional)

Git approach

Git approach