Posit AI Blog: Hugging Face Integration

We are pleased to announce that the first releases of hfhub and tok are now live on CRAN. hfhub is an R interface to Hugging Face Hub, allowing users to download and cache files from Hugging Face Hub, and tok implements R bindings for the Hugging Face tokenizer library.

The hugging faces changed rapidly. that much It’s a platform for building, sharing, and collaborating on deep learning applications, and we hope these integrations will help R users use Hugging Face tools and build new applications.

We also previously released the safetensors package, which can read and write files in the safetensors format.

HF hub

hfhub is the R interface to the Hugging Face Hub. hfhub currently implements a single function to download files from the Hub Store. The Model Hub repository is primarily used to store pretrained model weights along with other metadata required to load the model, such as hyperparameter configurations and tokenizer vocabulary.

Downloaded files are stored using the same layout as the Python library, allowing you to share cached files between R and Python implementations to make switching between languages easier and faster.

We’ve already downloaded pre-trained weights from Hugging Face Hub using the minhub package and hfhub from the ‘GPT-2 from scratch with torch’ blog post.

you can use hub_download() Download files from the Hugging Face Hub repository by specifying the path and repository ID of the file you want to download. If the file is already in the cache, the function returns the file path immediately; otherwise, it downloads the file, caches it, and then returns the access path.

path <- hfhub::hub_download("gpt2", "model.safetensors")
path
#> /Users/dfalbel/.cache/huggingface/hub/models--gpt2/snapshots/11c5a3d5811f50298f278a704980280950aedb10/model.safetensors

Talk

Tokenizers are responsible for converting raw text into sequences of integers that are often used as input to NLP models, making them an important component of the NLP pipeline. If you want a higher level overview of NLP pipelines, I recommend reading our previous blog post ‘What is a Large Language Model?’ ‘What are they?’.

When using pre-trained models (both inference or fine-tuning), it is very important to use the exact same tokenization process that was used during training. The Hugging Face team did an amazing job making sure their algorithms matched up. Tokenization strategies were used by most LLMs.

tok provides R bindings to the 🤗 tokenizers library. The tokenizer library is implemented natively in Rust for performance, and the bindings use the expander project to help interface with R. tok allows you to tokenize text in the same way as most NLP models, making pre-trained models easier to load in R. We can also share our model with the wider NLP community.

tok can be installed from CRAN, and its current use is limited to loading the tokenizer vocabulary from a file. For example, you can load the tokenizer for a GPT2 model using:

tokenizer <- tok::tokenizer$from_pretrained("gpt2")
ids <- tokenizer$encode("Hello world! You can use tokenizers from R")$ids
ids
#> (1) 15496   995     0   921   460   779 11241 11341   422   371
tokenizer$decode(ids)
#> (1) "Hello world! You can use tokenizers from R"

gap

Remember that Hugging Face Spaces can already host Shiny (for R and Python). For example, I built a Shiny app that uses:

Torch for implementing GPT-NeoX (Neural Network Architecture of StableLM – the model used in chat)
hfhub, which downloads and caches pre-trained weights from the StableLM repository.
Tokenize and preprocess the text as input to the torch model. tok also uses hfhub to download the tokenizer’s vocabulary.

Apps are hosted in this space. It currently runs on CPU, but you can easily switch Docker images if you want to run on GPU for faster inference.

The app source code is also open source and can be found in the Spaces Files tab.

It is expected

We are in the early stages of hfhub and tok, and there is still a lot of work to be done and features to be implemented. We look to the community to help us prioritize our work. So, if you see any missing features, please open an issue on our GitHub repository.

recycle

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Illustrations reused from other sources do not fall under this license and can be recognized by the note “Illustration of…” in the caption.

Summons

To give attribution, please cite this work as follows:

Falbel (2023, July 12). Posit AI Blog: Hugging Face Integrations. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2023-07-12-hugging-face-integrations/

BibTeX Quotes

@misc{hugging-face-integrations,
  author = {Falbel, Daniel},
  title = {Posit AI Blog: Hugging Face Integrations},
  url = {https://blogs.rstudio.com/tensorflow/posts/2023-07-12-hugging-face-integrations/},
  year = {2023}
}

Posit AI Blog: Hugging Face Integration

‘Star Wars: The Skeleton Crew’: Find out when Episode 6 comes out on Disney Plus

Which part of this image was generated by AI?

Top 10 News Stories of 2024

Light it up! Snoop Dogg carries the Olympic torch at the final games in Paris – National

Gausman contributes to Blue Jays’ sweep of Angels

A Drake security guard was shot outside his Toronto home.

The Jays scored four runs in the eighth to beat the Rays 6-3.

Google Maps captured the man loading the body into his car. Thanks to this, the police uncovered an old murder case.

CBRN Detection System for European Defense Frequency

Republican fight over green subsidies reaches boiling point – POLITICO

James Anderson retires after Lord’s Test against West Indies

Our Picks

Rocks and diamonds: Two Wallabies stars who fell off cliffs in 2024

Joe Root and Ollie Pope feature in worst DRS review of all time, video

GTC vs WSS Dream11 Prediction Match 2 KFC T20 Max Competition 2024

Most Popular

Light it up! Snoop Dogg carries the Olympic torch at the final games in Paris – National

Gausman contributes to Blue Jays’ sweep of Angels

A Drake security guard was shot outside his Toronto home.

Posit AI Blog: Hugging Face Integration

HF hub

Talk

gap

It is expected

recycle

Summons

Related Posts