start
While working on a Databricks with R workshop a few months ago, I discovered some of their custom SQL functions. These specific functions are prefixed with “ai_” and execute NLP with simple SQL calls.
> SELECT ai_analyze_sentiment('I am happy');
positive
> SELECT ai_analyze_sentiment('I am sad');
negative
This was a revelation to me. This showed me new ways to use the LLM in my daily work as an analyst. So far I’ve mainly hired LLMs for code completion and development work. However, this new approach instead focuses on using LLM directly on the data.
My first reaction was to access the custom function through R.
dbplyr
We have access to SQL functions in R, and it was great to see it in action.
orders |>
mutate(
sentiment = ai_analyze_sentiment(o_comment)
)
#> # Source: SQL (6 x 2)
#> o_comment sentiment
#> <chr> <chr>
#> 1 ", pending theodolites … neutral
#> 2 "uriously special foxes … neutral
#> 3 "sleep. courts after the … neutral
#> 4 "ess foxes may sleep … neutral
#> 5 "ts wake blithely unusual … mixed
#> 6 "hins sleep. fluffily … neutral
One downside to this integration is that, even though it can be accessed through R, leveraging LLM in this way requires a live connection to Databricks, limiting the number of people who can benefit from it.
According to the document, Databricks is utilizing the Llama 3.1 70B model. This is a very effective large language model, but its sheer size poses serious problems for most user systems, making it impractical to run on standard hardware.
reach survivability
LLM development is accelerating at a rapid pace. Initially, only online large language models (LLMs) were routinely available. This has raised concerns among companies that are hesitant to share their data externally. Moreover, the costs of using an online LLM can be significant, and per-token fees can add up quickly.
The ideal solution would be to integrate LLM into your own system, which requires three prerequisites:
- A model that fits comfortably into your memory
- Models that achieve sufficient accuracy for NLP tasks
- Intuitive interface between model and user notebook
Last year, it was nearly impossible to have all three of these elements. Models that could fit into memory were either inaccurate or extremely slow. However, recent advances such as cross-platform interaction engines such as Meta’s Llama and Ollama have made it possible to deploy such models, providing a promising solution for enterprises looking to integrate LLM into their workflows.
project
This project began as an exploration based on my interest in leveraging a “general purpose” LLM to produce results similar to those of Databricks AI capabilities. The main challenge was determining how much setup and preparation was required for these models to provide reliable and consistent results.
Since we did not have access to design documentation or open source code, we only relied on the output of the LLM as a testing basis. This presented several obstacles, including the numerous options available to fine-tune the model. Even within rapid engineering, the possibilities are endless. A delicate balance had to be found between accuracy and generality to ensure that the model was not too specialized or focused on a particular topic or outcome.
Fortunately, after conducting extensive testing, we found that simple “one-time” prompts yielded the best results. “Best” means that the answer is accurate for a specific row and consistent across rows. Consistency was important as it meant giving an answer that was one of the specified options (positive, negative or neutral) without further explanation.
Here’s an example prompt that works reliably for Llama 3.2:
>>> You are a helpful sentiment engine. Return only one of the
... following answers: positive, negative, neutral. No capitalization.
... No explanations. The answer is based on the following text:
... I am happy
positive
Please note, attempts to submit multiple rows at once failed. In practice, I’ve spent quite a bit of time exploring different approaches, including submitting 10 or 2 rows simultaneously and formatting them in JSON or CSV format. The results were often inconsistent, and they didn’t seem to accelerate the process enough to be worth the effort.
Once I was comfortable with the approach, the next step was to wrap the functionality within an R package.
approach
One of my goals was to make the mall package as “ergonomic” as possible. That said, we wanted the use of our packages in R and Python to integrate seamlessly with the way data analysts use their favorite languages every day.
For R this was relatively simple: I just had to make sure the function worked well with pipes (%>%
and |>
) and can be easily integrated into packages such as: tidyverse
:
reviews |>
llm_sentiment(review) |>
filter(.sentiment == "positive") |>
select(review)
#> review
#> 1 This has been the best TV I've ever used. Great screen, and sound.
However, since Python is not my native language, I had to adjust my thinking about data manipulation. Specifically, I learned that objects in Python (e.g. pandas DataFrames) “embed” transformation functions by design.
This insight led me to investigate whether the Pandas API allows extensions, and fortunately, it does! After exploring the possibilities, I decided to start with Polar, which allows me to extend the API by creating new namespaces. This simple addition gave users easy access to the functionality they needed.
>>> import polars as pl
>>> import mall
>>> df = pl.DataFrame(dict(x = ("I am happy", "I am sad")))
>>> df.llm.sentiment("x")
shape: (2, 2)
┌────────────┬───────────┐
│ x ┆ sentiment │
│ --- ┆ --- │
│ str ┆ str │
╞════════════╪═══════════╡
│ I am happy ┆ positive │
│ I am sad ┆ negative │
└────────────┴───────────┘
By keeping all new features within the llm namespace, we make it very easy for users to find and utilize the features they need.
What’s next?
I think it’ll be easier to see what you’re coming for. mall
Once the community uses it and provides feedback. We anticipate that adding more LLM backends will be a major request. Another possible improvement is that the prompts may need to be updated for newly updated models as they become available. I experienced moving from LLama 3.1 to Llama 3.2. One of the prompts needed to be adjusted. The package is structured in such a way that future changes will be added to the package rather than replacing prompts, to maintain backward compatibility.
This is my first time writing an article about the history and structure of the project. This particular endeavor was so unique because of its R + Python and LLM aspects that I thought it was worth sharing.
If you want to find out more mall
Visit the official site (https://mlverse.github.io/mall/).