Introduction
Hugging FAce is a data science and community platform that provides tools to easily build, train and deploy ML models
Hugging Face
Learning Goals
You can access over 300,000 models at hf.co/models.
You will see gpt2 as one of the models with the most downloads. Let’s click on it.
The website will take you to the model card when you click a model.
At the right column, you can play with the model directly in the browser using the Inference API.
GPT2 is a text generation model, so it will generate additional text given an initial input.
Try typing something like, “It was a bright and sunny day.”
In the middle, you can go through the model card content.
It has sections such as Intended uses & limitations, Training procedure, and Citation Info.
At Hugging Face, everything is based in Git repositories and is open-sourced.
You can click the “Files and Versions” tab, which will allow you to see all the repository files, including the model weights.
The model card is a markdown file (README.md) which on top of the content contains metadata such as the tags.
Just as with GitHub, you can do things such as Git cloning, adding, committing, branching, and pushing.
Open the config.json file of the GPT2 repository.
The config file contains hyperparameters as well as useful information for loading the model.
At the left of https://huggingface.co/models, you can filter for different things:
Tasks: Computer Vision, Natural Language Processing, Audio, and more.
Libraries: You can find models of Keras, PyTorch, spaCy, allenNLP, and more.
Datasets: The Hub also hosts thousands of datasets, as you’ll find more about later.
Languages: Many of the models on the Hub are NLP-related. You can find models for hundreds of languages.
Learn how to upload a model to the Hub.
Go to huggingface.co/new to create a new model repository.
You start with a public repo that has a model card.
You can upload your model either by using the Web UI or by doing it with Git.
Note
Take a look at the appendix to learn how to use Git
Now that the model is in the Hub, others can find them!
You can also collaborate with others easily by creating an organization.
Hosting through the Hub allows a team to update repositories and do things you might be used to, such as working in branches and working collaboratively.
The Hub also enables versioning in your models: if a model checkpoint is suddenly broken, you can always head back to a previous version.
The Hub hosts around 3000 datasets that are open-sourced and free to use in multiple domains.
On top of it, the open-source datasets library allows the easy use of these datasets
Similar to models, you can head to https://hf.co/datasets. At the left, you can find different filters based on the task, license, and size of the dataset.
Let’s explore the GLUE dataset, which is a famous dataset used to test the performance of NLP models.
Similar to model repositories, you have a dataset card that documents the dataset. If you scroll down a bit, you will find things such as the summary, the structure, and more.
At the top, you can explore a slice of the dataset directly in the browser.
The GLUE dataset is divided into multiple sub-datasets (or subsets) that you can select, such as COLA and QNLI.
How to create an interactive, publicly available demo
Demos allow:
Congratulations! You have completed this tutorial 👍
Next, you may want to go back to the lab’s website
Acknowledgments: The slides are mainly based on a toolkit provided by HuggingFace
If you want to understand the complete workflow how to upload models, let’s go with the Git approach.
HuggingFace already provides a list of common file extensions for the large files in .gitattributes
If the files you want to upload are not included in the .gitattributes file, you might need as shown here: You can do so with:
h5 file.pytorch_model.bin.joblib file.Here is an example in Python saving a Scikit-Learn model file.
And we’re done! You can check your repository with all the recently added files!
The UI allows you to explore the model files and commits and to see the diff introduced by each commit.
Jan Kirenz