safetensors is a new, simple, fast and secure file format for storing tensors. The design and original implementation of the file format is led by Hugging Face and has been widely adopted in the popular ‘transformers’ framework. The safetensors R package is a pure R implementation that can read and write safetensor files.
The initial version (0.1.0) of safetensor is now on CRAN.
Motivation
The main motivation for safetensor in the Python community is security. As stated in the official documentation:
The main rationale for this crate is to remove the need to use pickle in PyTorch, which is used by default.
Pickle is considered an unsafe format because loading a Pickle file may lead to arbitrary code execution. For R users, this was not an issue with Torch, because the Pickle parser included with LibTorch only supports a subset of Pickle types that do not contain executable code.
However, the file format has additional advantages over other commonly used formats:
-
Lazy loading support: You can choose to read a subset of tensors stored in a file.
-
Zero copy: Reading a file requires no more memory than the file itself. (Technically the current R implementation makes a single copy, but this could be optimized if you really need it at some point.)
-
Simple: The file format implementation is simple and does not require complex dependencies. That is, it is a suitable format for exchanging tensors between ML frameworks and between different programming languages. For example, you can write a safetensors file in R and load it in Python, and vice versa.
It has additional advantages over other commonly used file formats in this space, and you can see a comparison table here.
format
The safetensors format is illustrated in the figure below. It’s basically a header file that contains some metadata and raw tensor buffers.
Basic usage
safetensor can be installed from CRAN using:
install.packages("safetensors")
You can then build a named list of torch tensors.
library(torch)
library(safetensors)
tensors <- list(
x = torch_randn(10, 10),
y = torch_ones(10, 10)
)
str(tensors)
#> List of 2
#> $ x:Float (1:10, 1:10)
#> $ y:Float (1:10, 1:10)
tmp <- tempfile()
safe_save_file(tensors, tmp)
You can pass additional metadata to the saved file by providing: metadata
A parameter containing a named list.
Reading the safetensors file is handled by: safe_load_file
And it returns a list of named tensors with: metadata
Property containing the parsed file header.
tensors <- safe_load_file(tmp)
str(tensors)
#> List of 2
#> $ x:Float (1:10, 1:10)
#> $ y:Float (1:10, 1:10)
#> - attr(*, "metadata")=List of 2
#> ..$ x:List of 3
#> .. ..$ shape : int (1:2) 10 10
#> .. ..$ dtype : chr "F32"
#> .. ..$ data_offsets: int (1:2) 0 400
#> ..$ y:List of 3
#> .. ..$ shape : int (1:2) 10 10
#> .. ..$ dtype : chr "F32"
#> .. ..$ data_offsets: int (1:2) 400 800
#> - attr(*, "max_offset")= int 929
Currently safetensor only supports writing torch tensors, but we plan to add support for writing regular R arrays and tensorflow tensors in the future.
future direction
In the next version of Torch safetensors
Used as a serialization format. That is, when calling torch_save()
For models, tensor lists, or other supported types of objects torch_save
You will get a valid safetensors file.
This is an improvement over the previous implementation for the following reasons:
-
It’s much faster. For midsize models, this is 10 times more. For large files it may be more. This also improves the performance of parallel data loaders by up to 30%.
-
This improves cross-language and cross-framework compatibility. You can train a model in R and use it in Python (and vice versa), or train a model in tensorflow and run it using Torch.
If you want to try it out, you can install the Torch development version using:
remotes::install_github("mlverse/torch")
Photo by Nick Fewings on Unsplash
recycle
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Illustrations reused from other sources do not fall under this license and can be recognized by the note “Illustration of…” in the caption.
Summons
To give attribution, please cite this work as follows:
Falbel (2023, June 15). Posit AI Blog: safetensors 0.1.0. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2023-06-15-safetensors/
BibTeX Quotes
@misc{safetensors, author = {Falbel, Daniel}, title = {Posit AI Blog: safetensors 0.1.0}, url = {https://blogs.rstudio.com/tensorflow/posts/2023-06-15-safetensors/}, year = {2023} }