Navigating Hugging Face Transformers: Your Guide to Open Source 🤗 Models, Datasets, Spaces, and More

Unlocking the Power of Open Source AI with Hugging Face Transformers

Mar 18, 2024

Open-source AI and Hugging Face have become synonymous. With open-source software, we can seamlessly integrate state-of-the-art AI solutions like large language models (LLMs), image recognition, text-to-speech and many other models to rapidly build innovative applications. Hugging Face revolutionizes this process by making open-source AI models and datasets easily accessible.

In this post, we will delve into the Hugging Face Transformers library, learn how to navigate the plethora of available models and datasets, explore apps built by fellow developers, and get started building our own multimodal AI application – all with just a few lines of code using Hugging Face Transformers.

A Primer on 🤗 Transformers

🤗 Transformers: Revolutionizing Machine Learning

🤗 (Hugging Face) Transformers is a library that has revolutionized the way we approach machine learning and artificial intelligence tasks across various domains. At its core, Transformers provides a comprehensive suite of APIs and tools designed to facilitate the easy download, training, and deployment of state-of-the-art pretrained models. This library stands out for its ability to significantly reduce compute costs, carbon footprint, and the time and resources required for training models from scratch, making advanced machine learning more accessible and efficient. 🚀

⚡ The Power of Transformer Models

The term "transformers" in 🤗 Transformers refers to the foundational concept of transformer models in natural language processing (NLP). Transformers are a type of deep learning model architecture that has gained immense popularity due to their ability to handle sequential data efficiently. Unlike traditional recurrent neural networks (RNNs) or convolutional neural networks (CNNs), transformers utilize self-attention mechanisms to weigh the significance of different input elements when making predictions. This attention mechanism allows transformers to capture long-range dependencies in data more effectively, making them particularly well-suited for tasks like language translation, text generation, etc.

The transformer architecture diagram reproduced from the Attention Is All You Need paper

🛠️ Versatility Across Domains

One of the key advantages of using 🤗 Transformers is its support for a wide range of tasks across different modalities. The 🤗 Models page showcases an extensive collection of models. For NLP, it offers capabilities for text classification, named entity recognition, question answering, language modeling, summarization, translation, and more. In the realm of Computer Vision, it supports image classification, object detection, and segmentation to name a few. For audio processing, it provides tools for automatic speech recognition, audio classification, etc. Moreover, it excels in multimodal tasks such as table question answering, optical character recognition, information extraction from scanned documents, video classification, and visual question answering.

🤝 Interoperability and Community

The library promotes framework interoperability, supporting PyTorch, TensorFlow, and JAX. This flexibility allows users to train a model in one framework and load it for inference in another, or even export models to formats like ONNX and TorchScript for deployment in production environments. The Hugging Face community, accessible through the Hub, forum, or Discord, provides a platform for collaboration and sharing, further enriching the ecosystem.

📚 Datasets Library: Powering Your Projects

In addition to models, Hugging Face offers the 🤗 Datasets interface and library, a powerful tool for accessing and sharing datasets for audio, computer vision, and NLP tasks. It enables loading datasets in a single line of code and provides efficient data processing methods to prepare datasets for training. Backed by the Apache Arrow format, it allows for processing large datasets with optimal speed and efficiency. The integration with the 🤗 Hub makes it easy to load and share datasets with the machine learning community.

🌐 Hugging Face Spaces: Showcase Your ML Prowess

🤗 Spaces provide a convenient platform to host ML demo apps on your profile or organization's profile, enabling you to build your ML portfolio, showcase projects, and collaborate within the ML community. Users can easily create apps using Streamlit and Gradio SDKs, deploy custom applications with Docker, or develop static Spaces with JavaScript and HTML. Additionally, Spaces can be upgraded to run on GPU or other accelerated hardware for enhanced performance.

The video below shows an example of running an app that’s trending on the 🤗 Spaces at the time of writing. Explore more apps on 🤗 Spaces!

Together, 🤗 Transformers, Datasets and Spaces form a robust foundation for developing and deploying machine learning models. Whether you are a beginner looking to get started with machine learning or an experienced practitioner seeking to leverage the latest advancements, Hugging Face offers the tools and community support to accelerate your projects and achieve remarkable results. 🏆

Navigating the Treasure Trove of 🤗 Models and Datasets

At the time of writing, there are a staggering 552,137 models available on the 🤗 Models page. With such an overwhelming number of options, where do we even begin?

Well, the first step is to think about the specific task you want to tackle for your application and select the corresponding task from the left side panel. Additionally, you can further refine your search by specifying language support, license type preferences, etc.

In the video below, I will demonstrate how to narrow down the model selection process. First, select the "Text-to-Speech" task, followed by choosing "French" as the desired language and filtering for models with a permissive MIT license. Notice how the number of models matching my requirements is reduced to a manageable nine options. Conveniently, we can also sort the resulting models based on various criteria such as trending popularity, number of likes, downloads, creation date, or the latest update.

This intuitive filtering system provided by 🤗 Transformers empowers you to quickly identify the most relevant models for your specific needs, saving you valuable time and effort.

Once you have identified a promising model from the search results, clicking on it will take you to the dedicated “Model card” page. This page serves as a comprehensive hub, often containing detailed information about the model, including its capabilities, model variants, limitations, example code snippets, scripts, essential notes on ethical usage considerations, etc.

As demonstrated in the video below, you can switch to the "Files and versions" tab to check the size of the pytorch_model.bin file. This file size provides an estimate of the memory requirements for running the model, typically around 1.2 times the file size. With this information, you can choose an appropriate model or plan your hardware resources accordingly.

Back on the “Model card” page, you will find the "Use in Transformers" section, which offers code snippets to easily load the model checkpoint from the 🤗 Hub. A checkpoint refers to the saved model, including its weights and configurations, encapsulating the trained state of the model.

The provided code snippets offer two options: an approach using the model class directly, or a more user-friendly approach leveraging the Pipeline object. The latter provides a high-level abstraction for various tasks, automatically handling necessary preprocessing and post-processing steps, such as tokenization or encoding of text data, to ensure compatibility with the model's input format.

# Use a pipeline as a high-level helper

from transformers import pipeline
pipe = pipeline("text-to-speech", model="suno/bark")

Similarly, you can filter, find and use relevant 🤗 Datasets for your tasks.

from datasets import load_dataset
dataset = load_dataset("csebuetnlp/xlsum")

Let’s build something together now!

🚀 Building a Multimodal AI Language App with Hugging Face Transformers

We will chain together two powerful models to build an app that would help us improve our French language abilities.

You can run the codes in your local environment or on Google Colab.

Github code repository:

https://github.com/kmjawadurrahman/huggingface_transformers_translate_tts

🌍 The Translator Component

First, we need to install the transformers and torch Python packages in a virtual environment:

pip install transformers
pip install torch

Here's the code for the translator component:

import torch
from transformers import pipeline

translator = pipeline(task="translation",
                      model="facebook/nllb-200-distilled-600M",
                      torch_dtype=torch.bfloat16)

english_text = "What is you name?"

translated_french_text = translator(english_text,
                                    src_lang="eng_Latn",
                                    tgt_lang="fra_Latn")

print(translated_french_text[0]["translation_text"])

Let's break down the code:

We import the necessary libraries: torch for tensor operations and pipeline from the transformers library.
We load the translation model using pipeline(“translation”, ...). The task="translation" parameter specifies that we want to use the translation task. The model parameter specifies the pre-trained model to use, in this case, “facebook/nllb-200-distilled-600M”, which is a multilingual translation model from Facebook (or Meta). The bfloat16 data type strikes a balance between accuracy and computational efficiency.
We call the translator object with the english_text as input, specifying the source language (src_lang=“eng_Latn”) and target language (tgt_lang=“fra_Latn”).

The output we get is the correct French translation “Comment vous appelez-vous ?”.

Note: It will take a while to load a model for the first time. The model gets cached so subsequent loads are faster. Cache location at: ~/.cache/huggingface/hub, and for Windows: C:\Users\username\.cache\huggingface\hub
If you are running the codes in your local environment, you may want to delete unused models to free up your storage.

🔊 Adding a Voice to the Text

We can leverage the 🤗 Transformers once again to add text-to-speech. However, this time, instead of using the high-level pipeline helper, we will load the model and processor directly. Here's the code:

from transformers import BarkModel, AutoProcessor


device = "cuda:0" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained("suno/bark")
model = BarkModel.from_pretrained("suno/bark")

model = model.to(device)

inputs = processor("Comment vous appelez-vous ?", voice_preset="fr_speaker_3")

speech_output = model.generate(**inputs.to(device))

print(speech_output)

We import the necessary modules from the transformers library: BarkModel for the text-to-speech model, and AutoProcessor for processing the input text.
We check if a GPU is available using torch.cuda.is_available() and set the device accordingly ("cuda:0" for GPU or "cpu" for CPU).
We load the text-to-speech processor and model using AutoProcessor.from_pretrained("suno/bark") and BarkModel.from_pretrained("suno/bark"), respectively. The "suno/bark" model is a pre-trained text-to-speech model from Hugging Face.
We move the model to the appropriate device using model.to(device).
We prepare the input text and voice preset using processor("Comment vous appelez-vous ?", voice_preset="fr_speaker_3"). The voice_preset parameter specifies the voice we want to use for the text-to-speech output. In this case, we're using the "fr_speaker_3" preset, which is a French voice.
We generate the speech output by calling model.generate(**inputs.to(device)), passing the processed input to the model.
Finally, we print the speech_output, which will contain the audio data for the generated speech.

The speech output is a tensor of raw audio samples, which are numerical values representing the waveform of the audio signal.

🧩 Building the Language App with Gradio

Now that we have the translator and text-to-speech components ready, let's bring them together to build our language learning app using Gradio. First, we need to install two more required packages:

pip install gradio
pip install gradio_client

Here's the code that combines the two components and creates an interactive app with Gradio:

# app.py

import gradio as gr
import torch
from transformers import pipeline, BarkModel, AutoProcessor

# Set the device
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Load the translation model
translator = pipeline(task="translation",
                      model="facebook/nllb-200-distilled-600M",
                      torch_dtype=torch.bfloat16)

# Load the text-to-speech processor and model
processor = AutoProcessor.from_pretrained("suno/bark")
model = BarkModel.from_pretrained("suno/bark")

# Get the sampling rate for the audio output
sampling_rate = model.generation_config.sample_rate

# Move the model to the appropriate device
model = model.to(device)

# Function to handle the translation and text-to-speech
def launch(input_text):
    # Translate the input text to French
    translated_french_text = translator(input_text,
                                        src_lang="eng_Latn",
                                        tgt_lang="fra_Latn")

    # Prepare the input for the text-to-speech model
    speech_input = processor(translated_french_text[0]["translation_text"],
                             voice_preset="fr_speaker_3")

    # Generate the speech output
    speech_output_tensor = model.generate(**speech_input.to(device))

    # Convert the speech output tensor to a numpy array
    speech_output = speech_output_tensor[0].cpu().numpy()

    # Return the translated text and the audio output
    return translated_french_text[0]["translation_text"], (sampling_rate, speech_output)

# Create the Gradio interface
interface = gr.Interface(launch,
                         inputs="text",
                         outputs=["text", gr.Audio()])

# Launch the app
interface.launch()

We import the necessary libraries: gradio for creating the interactive app, torch for tensor operations, and pipeline, BarkModel, and AutoProcessor from the transformers library.
We set the device based on GPU availability.
We load the translation model using pipeline and the text-to-speech processor and model using AutoProcessor and BarkModel, respectively.
We get the sampling rate for the audio output from the text-to-speech model's configuration.
We move the text-to-speech model to the appropriate device.
We define a function launch that takes the input text, translates it to French using the translator component, generates speech output using the text-to-speech component, and returns the translated text and audio output.
We create a Gradio interface using gr.Interface, specifying the launch function as the main function, the input as "text", and the outputs as "text" (for the translated text) and gr.Audio() (for the audio output).
Finally, we launch the app using interface.launch().

Running this app.py python file will make the application available on your web browser at http://127.0.0.1:7860/

Here is the app in action:

Note: The app takes a long time to run because I am using a MacBook. It should run much faster on Google Colab (GPU enabled) or on a Windows machine with a decent Nvidia GPU.

Conclusion

In this introductory blog post, we explored the world of Hugging Face Transformers, covering a primer on the library, navigating its vast collection of models and datasets, and building a multimodal AI language app using translation and text-to-speech models. In the next post, we will learn how to deploy the app using Gradio and 🤗 Hub, and create a deployed API for seamless integration.

Neural Narratives 🤖

Discussion about this post