We are pleased to announce that the first releases of hfhub and tok are now live on CRAN. hfhub is an R interface to Hugging Face Hub, allowing users to download and cache files from Hugging Face Hub, and tok implements R bindings for the Hugging Face tokenizer library.
The hugging faces changed rapidly. that much It’s a platform for building, sharing, and collaborating on deep learning applications, and we hope these integrations will help R users use Hugging Face tools and build new applications.
We also previously released the safetensors package, which can read and write files in the safetensors format.
HF hub
hfhub is the R interface to the Hugging Face Hub. hfhub currently implements a single function to download files from the Hub Store. The Model Hub repository is primarily used to store pretrained model weights along with other metadata required to load the model, such as hyperparameter configurations and tokenizer vocabulary.
Downloaded files are stored using the same layout as the Python library, allowing you to share cached files between R and Python implementations to make switching between languages ​​easier and faster.
We’ve already downloaded pre-trained weights from Hugging Face Hub using the minhub package and hfhub from the ‘GPT-2 from scratch with torch’ blog post.
you can use hub_download()
Download files from the Hugging Face Hub repository by specifying the path and repository ID of the file you want to download. If the file is already in the cache, the function returns the file path immediately; otherwise, it downloads the file, caches it, and then returns the access path.
path <- hfhub::hub_download("gpt2", "model.safetensors")
path
#> /Users/dfalbel/.cache/huggingface/hub/models--gpt2/snapshots/11c5a3d5811f50298f278a704980280950aedb10/model.safetensors
Talk
Tokenizers are responsible for converting raw text into sequences of integers that are often used as input to NLP models, making them an important component of the NLP pipeline. If you want a higher level overview of NLP pipelines, I recommend reading our previous blog post ‘What is a Large Language Model?’ ‘What are they?’.
When using pre-trained models (both inference or fine-tuning), it is very important to use the exact same tokenization process that was used during training. The Hugging Face team did an amazing job making sure their algorithms matched up. Tokenization strategies were used by most LLMs.
tok provides R bindings to the 🤗 tokenizers library. The tokenizer library is implemented natively in Rust for performance, and the bindings use the expander project to help interface with R. tok allows you to tokenize text in the same way as most NLP models, making pre-trained models easier to load in R. We can also share our model with the wider NLP community.
tok can be installed from CRAN, and its current use is limited to loading the tokenizer vocabulary from a file. For example, you can load the tokenizer for a GPT2 model using:
tokenizer <- tok::tokenizer$from_pretrained("gpt2")
ids <- tokenizer$encode("Hello world! You can use tokenizers from R")$ids
ids
#> (1) 15496 995 0 921 460 779 11241 11341 422 371
tokenizer$decode(ids)
#> (1) "Hello world! You can use tokenizers from R"
gap
Remember that Hugging Face Spaces can already host Shiny (for R and Python). For example, I built a Shiny app that uses:
- Torch for implementing GPT-NeoX (Neural Network Architecture of StableLM – the model used in chat)
- hfhub, which downloads and caches pre-trained weights from the StableLM repository.
- Tokenize and preprocess the text as input to the torch model. tok also uses hfhub to download the tokenizer’s vocabulary.
Apps are hosted in this space. It currently runs on CPU, but you can easily switch Docker images if you want to run on GPU for faster inference.
The app source code is also open source and can be found in the Spaces Files tab.
It is expected
We are in the early stages of hfhub and tok, and there is still a lot of work to be done and features to be implemented. We look to the community to help us prioritize our work. So, if you see any missing features, please open an issue on our GitHub repository.
recycle
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Illustrations reused from other sources do not fall under this license and can be recognized by the note “Illustration of…” in the caption.
Summons
To give attribution, please cite this work as follows:
Falbel (2023, July 12). Posit AI Blog: Hugging Face Integrations. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2023-07-12-hugging-face-integrations/
BibTeX Quotes
@misc{hugging-face-integrations, author = {Falbel, Daniel}, title = {Posit AI Blog: Hugging Face Integrations}, url = {https://blogs.rstudio.com/tensorflow/posts/2023-07-12-hugging-face-integrations/}, year = {2023} }