Navigating Hugging Face Transformers: Your Guide to Open Source ๐ค Models, Datasets, Spaces, and More
Unlocking the Power of Open Source AI with Hugging Face Transformers
Open-source AI and Hugging Face have become synonymous. With open-source software, we can seamlessly integrate state-of-the-art AI solutions like large language models (LLMs), image recognition, text-to-speech and many other models to rapidly build innovative applications. Hugging Face revolutionizes this process by making open-source AI models and datasets easily accessible.
In this post, we will delve into the Hugging Face Transformers library, learn how to navigate the plethora of available models and datasets, explore apps built by fellow developers, and get started building our own multimodal AI application โ all with just a few lines of code using Hugging Face Transformers.
A Primer on ๐ค Transformers
๐ค Transformers: Revolutionizing Machine Learning
๐ค (Hugging Face) Transformers is a library that has revolutionized the way we approach machine learning and artificial intelligence tasks across various domains. At its core, Transformers provides a comprehensive suite of APIs and tools designed to facilitate the easy download, training, and deployment of state-of-the-art pretrained models. This library stands out for its ability to significantly reduce compute costs, carbon footprint, and the time and resources required for training models from scratch, making advanced machine learning more accessible and efficient. ๐
โก The Power of Transformer Models
The term "transformers" in ๐ค Transformers refers to the foundational concept of transformer models in natural language processing (NLP). Transformers are a type of deep learning model architecture that has gained immense popularity due to their ability to handle sequential data efficiently. Unlike traditional recurrent neural networks (RNNs) or convolutional neural networks (CNNs), transformers utilize self-attention mechanisms to weigh the significance of different input elements when making predictions. This attention mechanism allows transformers to capture long-range dependencies in data more effectively, making them particularly well-suited for tasks like language translation, text generation, etc.

๐ ๏ธ Versatility Across Domains
One of the key advantages of using ๐ค Transformers is its support for a wide range of tasks across different modalities. The ๐ค Models page showcases an extensive collection of models. For NLP, it offers capabilities for text classification, named entity recognition, question answering, language modeling, summarization, translation, and more. In the realm of Computer Vision, it supports image classification, object detection, and segmentation to name a few. For audio processing, it provides tools for automatic speech recognition, audio classification, etc. Moreover, it excels in multimodal tasks such as table question answering, optical character recognition, information extraction from scanned documents, video classification, and visual question answering.
๐ค Interoperability and Community
The library promotes framework interoperability, supporting PyTorch, TensorFlow, and JAX. This flexibility allows users to train a model in one framework and load it for inference in another, or even export models to formats like ONNX and TorchScript for deployment in production environments. The Hugging Face community, accessible through the Hub, forum, or Discord, provides a platform for collaboration and sharing, further enriching the ecosystem.
๐ Datasets Library: Powering Your Projects
In addition to models, Hugging Face offers the ๐ค Datasets interface and library, a powerful tool for accessing and sharing datasets for audio, computer vision, and NLP tasks. It enables loading datasets in a single line of code and provides efficient data processing methods to prepare datasets for training. Backed by the Apache Arrow format, it allows for processing large datasets with optimal speed and efficiency. The integration with the ๐ค Hub makes it easy to load and share datasets with the machine learning community.
๐ Hugging Face Spaces: Showcase Your ML Prowess
๐ค Spaces provide a convenient platform to host ML demo apps on your profile or organization's profile, enabling you to build your ML portfolio, showcase projects, and collaborate within the ML community. Users can easily create apps using Streamlit and Gradio SDKs, deploy custom applications with Docker, or develop static Spaces with JavaScript and HTML. Additionally, Spaces can be upgraded to run on GPU or other accelerated hardware for enhanced performance.
The video below shows an example of running an app thatโs trending on the ๐ค Spaces at the time of writing. Explore more apps on ๐ค Spaces!
Together, ๐ค Transformers, Datasets and Spaces form a robust foundation for developing and deploying machine learning models. Whether you are a beginner looking to get started with machine learning or an experienced practitioner seeking to leverage the latest advancements, Hugging Face offers the tools and community support to accelerate your projects and achieve remarkable results. ๐
Navigating the Treasure Trove of ๐ค Models and Datasets
At the time of writing, there are a staggering 552,137 models available on the ๐ค Models page. With such an overwhelming number of options, where do we even begin?
Well, the first step is to think about the specific task you want to tackle for your application and select the corresponding task from the left side panel. Additionally, you can further refine your search by specifying language support, license type preferences, etc.
In the video below, I will demonstrate how to narrow down the model selection process. First, select the "Text-to-Speech" task, followed by choosing "French" as the desired language and filtering for models with a permissive MIT license. Notice how the number of models matching my requirements is reduced to a manageable nine options. Conveniently, we can also sort the resulting models based on various criteria such as trending popularity, number of likes, downloads, creation date, or the latest update.
This intuitive filtering system provided by ๐ค Transformers empowers you to quickly identify the most relevant models for your specific needs, saving you valuable time and effort.
Once you have identified a promising model from the search results, clicking on it will take you to the dedicated โModel cardโ page. This page serves as a comprehensive hub, often containing detailed information about the model, including its capabilities, model variants, limitations, example code snippets, scripts, essential notes on ethical usage considerations, etc.
As demonstrated in the video below, you can switch to the "Files and versions" tab to check the size of the pytorch_model.bin
file. This file size provides an estimate of the memory requirements for running the model, typically around 1.2 times the file size. With this information, you can choose an appropriate model or plan your hardware resources accordingly.
Back on the โModel cardโ page, you will find the "Use in Transformers" section, which offers code snippets to easily load the model checkpoint from the ๐ค Hub. A checkpoint refers to the saved model, including its weights and configurations, encapsulating the trained state of the model.
The provided code snippets offer two options: an approach using the model class directly, or a more user-friendly approach leveraging the Pipeline object. The latter provides a high-level abstraction for various tasks, automatically handling necessary preprocessing and post-processing steps, such as tokenization or encoding of text data, to ensure compatibility with the model's input format.
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-to-speech", model="suno/bark")
Similarly, you can filter, find and use relevant ๐ค Datasets for your tasks.
from datasets import load_dataset
dataset = load_dataset("csebuetnlp/xlsum")
Letโs build something together now!
๐ Building a Multimodal AI Language App with Hugging Face Transformers
We will chain together two powerful models to build an app that would help us improve our French language abilities.
You can run the codes in your local environment or on Google Colab.
https://github.com/kmjawadurrahman/huggingface_transformers_translate_tts
๐ The Translator Component
First, we need to install the transformers
and torch
Python packages in a virtual environment:
pip install transformers
pip install torch
Here's the code for the translator component:
import torch
from transformers import pipeline
translator = pipeline(task="translation",
model="facebook/nllb-200-distilled-600M",
torch_dtype=torch.bfloat16)
english_text = "What is you name?"
translated_french_text = translator(english_text,
src_lang="eng_Latn",
tgt_lang="fra_Latn")
print(translated_french_text[0]["translation_text"])
Let's break down the code:
We import the necessary libraries:
torch
for tensor operations andpipeline
from thetransformers
library.We load the translation model using
pipeline(โtranslationโ, ...)
. Thetask="translation"
parameter specifies that we want to use the translation task. Themodel
parameter specifies the pre-trained model to use, in this case,โfacebook/nllb-200-distilled-600Mโ
, which is a multilingual translation model from Facebook (or Meta). Thebfloat16
data type strikes a balance between accuracy and computational efficiency.We call the
translator
object with theenglish_text
as input, specifying the source language (src_lang=โeng_Latnโ
) and target language (tgt_lang=โfra_Latnโ
).
The output we get is the correct French translation โComment vous appelez-vous ?โ.
Note: It will take a while to load a model for the first time. The model gets cached so subsequent loads are faster. Cache location at:
~/.cache/huggingface/hub,
and for Windows:C:\Users\username\.cache\huggingface\hub
If you are running the codes in your local environment, you may want to delete unused models to free up your storage.
๐ Adding a Voice to the Text
We can leverage the ๐ค Transformers once again to add text-to-speech. However, this time, instead of using the high-level pipeline
helper, we will load the model and processor directly. Here's the code:
from transformers import BarkModel, AutoProcessor
device = "cuda:0" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained("suno/bark")
model = BarkModel.from_pretrained("suno/bark")
model = model.to(device)
inputs = processor("Comment vous appelez-vous ?", voice_preset="fr_speaker_3")
speech_output = model.generate(**inputs.to(device))
print(speech_output)
We import the necessary modules from the
transformers
library:BarkModel
for the text-to-speech model, andAutoProcessor
for processing the input text.We check if a GPU is available using
torch.cuda.is_available()
and set thedevice
accordingly ("cuda:0"
for GPU or"cpu"
for CPU).We load the text-to-speech processor and model using
AutoProcessor.from_pretrained("suno/bark")
andBarkModel.from_pretrained("suno/bark")
, respectively. The"suno/bark"
model is a pre-trained text-to-speech model from Hugging Face.We move the model to the appropriate device using
model.to(device)
.We prepare the input text and voice preset using
processor("Comment vous appelez-vous ?", voice_preset="fr_speaker_3")
. Thevoice_preset
parameter specifies the voice we want to use for the text-to-speech output. In this case, we're using the"fr_speaker_3"
preset, which is a French voice.We generate the speech output by calling
model.generate(**inputs.to(device))
, passing the processed input to the model.Finally, we print the
speech_output
, which will contain the audio data for the generated speech.
The speech output is a tensor of raw audio samples, which are numerical values representing the waveform of the audio signal.
๐งฉ Building the Language App with Gradio
Now that we have the translator and text-to-speech components ready, let's bring them together to build our language learning app using Gradio. First, we need to install two more required packages:
pip install gradio
pip install gradio_client
Here's the code that combines the two components and creates an interactive app with Gradio:
# app.py
import gradio as gr
import torch
from transformers import pipeline, BarkModel, AutoProcessor
# Set the device
device = "cuda:0" if torch.cuda.is_available() else "cpu"
# Load the translation model
translator = pipeline(task="translation",
model="facebook/nllb-200-distilled-600M",
torch_dtype=torch.bfloat16)
# Load the text-to-speech processor and model
processor = AutoProcessor.from_pretrained("suno/bark")
model = BarkModel.from_pretrained("suno/bark")
# Get the sampling rate for the audio output
sampling_rate = model.generation_config.sample_rate
# Move the model to the appropriate device
model = model.to(device)
# Function to handle the translation and text-to-speech
def launch(input_text):
# Translate the input text to French
translated_french_text = translator(input_text,
src_lang="eng_Latn",
tgt_lang="fra_Latn")
# Prepare the input for the text-to-speech model
speech_input = processor(translated_french_text[0]["translation_text"],
voice_preset="fr_speaker_3")
# Generate the speech output
speech_output_tensor = model.generate(**speech_input.to(device))
# Convert the speech output tensor to a numpy array
speech_output = speech_output_tensor[0].cpu().numpy()
# Return the translated text and the audio output
return translated_french_text[0]["translation_text"], (sampling_rate, speech_output)
# Create the Gradio interface
interface = gr.Interface(launch,
inputs="text",
outputs=["text", gr.Audio()])
# Launch the app
interface.launch()
We import the necessary libraries:
gradio
for creating the interactive app,torch
for tensor operations, andpipeline
,BarkModel
, andAutoProcessor
from thetransformers
library.We set the
device
based on GPU availability.We load the translation model using
pipeline
and the text-to-speech processor and model usingAutoProcessor
andBarkModel
, respectively.We get the sampling rate for the audio output from the text-to-speech model's configuration.
We move the text-to-speech model to the appropriate device.
We define a function
launch
that takes the input text, translates it to French using the translator component, generates speech output using the text-to-speech component, and returns the translated text and audio output.We create a Gradio interface using
gr.Interface
, specifying thelaunch
function as the main function, the input as"text"
, and the outputs as"text"
(for the translated text) andgr.Audio()
(for the audio output).Finally, we launch the app using
interface.launch()
.
Running this app.py
python file will make the application available on your web browser at http://127.0.0.1:7860/
Here is the app in action:
Note: The app takes a long time to run because I am using a MacBook. It should run much faster on Google Colab (GPU enabled) or on a Windows machine with a decent Nvidia GPU.
Conclusion
In this introductory blog post, we explored the world of Hugging Face Transformers, covering a primer on the library, navigating its vast collection of models and datasets, and building a multimodal AI language app using translation and text-to-speech models. In the next post, we will learn how to deploy the app using Gradio and ๐ค Hub, and create a deployed API for seamless integration.