Recent Articles computerworld It has been claimed that the output of generative AI systems such as GPT and Gemini is not as good as it used to be. This is not the first time I have heard this complaint, but I am not sure how widespread it is. But I am curious. Is it true? And if so, why?
I think there are a couple things going on in the AI world. First, AI system developers are trying to improve the output of their systems. They are (I guess) more focused on satisfying enterprise customers who can fulfill large contracts than on serving individuals who pay $20 a month. If I were to do that, I would adjust my models to produce more formal business prose. (Not good prose, but that’s all.) We can say “don’t just paste AI output into your reports” as often as we want, but that doesn’t mean people won’t do it. And AI developers will try to give them what they want.
Learn faster. Dig deeper. See further.
AI developers are certainly trying to make models more accurate. Error rates are noticeably lower, but they are far from zero. But tuning a model to a lower error rate may mean limiting its ability to come up with unusual answers that we think are brilliant, insightful, or surprising. That’s useful. Reducing the standard deviation cuts off the tail. The price we pay for minimizing hallucinations and other errors is minimizing accurate, “good” outliers. I’m not arguing that developers shouldn’t minimize hallucinations, but there is a price to pay.
“AI blues” is also caused by model collapse. I think model collapse is a real thing. I’ve done some very unscientific experiments of my own. But it’s too early to see it in the large-scale language models we use. The models aren’t retrained often enough, and the amount of content generated by AI from training data is still relatively small, especially if the model creators are violating copyright on a large scale.
But there is another possibility that is very human, and has nothing to do with language models themselves. ChatGPT has been around for almost two years. When it was released, we were all surprised at how good it was. One or two people pointed out the prophetic words of Samuel Johnson in the 18th century: “Sir, ChatGPT’s output is like a dog walking on its hind legs. It didn’t work out well, but I was surprised that it didn’t work at all.”1 Well, we were all surprised. Errors, hallucinations and all. We were surprised to find that computers can actually engage in conversation. Even we who tried GPT-2 did it quite fluently.
But now, almost two years later, we’ve gotten used to ChatGPT and its peers, Gemini, Claude, Llama, Mistral, and a bunch more. We’ve started using GenAI for real work, and the surprise has gone away. We’re less tolerant of its obsessive verbosity (it may have increased). We don’t think it’s insightful or original (but we don’t know if it is). The quality of language model output has likely gotten worse over the past two years, but I think we’ve become less tolerant.
I’m sure many people have tested this much more rigorously than I have, but I’ve run two tests on most language models since their early days.
- Writing a Petrarchan sonnet. (Petrarch’s sonnets have a different rhyme scheme than Shakespeare’s sonnets.)
- Correct implementation of a well-known but non-trivial algorithm in Python (I usually use the Miller-Rabin test for prime numbers).
The results of the two tests are surprisingly similar. Until a few months ago, the main LLMs could not write a Petrarchan sonnet. They could correctly describe a Petrarchan sonnet, but when asked to write one, they would mess up the rhyme scheme and usually give a Shakespearean sonnet instead. They failed even when the Petrarchan rhyme scheme was included in the prompt. They failed even when they tried it in Italian (an experiment done by one of my colleagues). Suddenly, around Claude III, the models learned to do Petrarch correctly. They got better. A while ago, I decided to try two more difficult poetic forms, the sestina and the villanelle. (A villanelle involves cleverly repeating two lines in addition to following a rhyme scheme. A sestina requires reusing the same rhyming words.) They could! They were no match for the Provençal troubadours, but they could!
I got the same result when I asked GPT-3 to write a program that implements the Miller-Rabin algorithm to test if a large number is prime. When GPT-3 first came out, it was a total failure. It produced code that ran without errors, but it told me that numbers like 21 were prime. Gemini was the same, but after several attempts, it rudely blamed the Python library for the problem of calculating large numbers. (I guess they don’t like it when users say, “Sorry, you got it wrong again. What’s wrong?”) Now, at least the last time I tried, it implemented the algorithm correctly. (Your mileage may vary.)
My success doesn’t mean there’s no room for frustration. I asked ChatGPT how to improve a program that worked correctly but had known problems. In some cases, I knew the problem and the solution, and in some cases, I understood the problem but didn’t know how to fix it. The first time you try it, you’ll probably be impressed. “Put more of your program into functions and use more descriptive variable names” might not be what you want, but it’s never bad advice. But the second or third time you try it, you’ll realize that you’re getting similar advice all the time, and while few people will object, it’s not really insightful. “I’m surprised it didn’t work at all” quickly turns into “It didn’t work.”
This experience probably reflects a fundamental limitation of language models. After all, they are not “intelligent” per se. Until we know otherwise, they are just predicting what will come next based on their analysis of training data. How much of the code on GitHub or Stack Overflow actually exhibits good coding practices? How much is somewhat mundane, like my code? I suspect the latter group will dominate, and that is reflected in the output of the LLM. Thinking back to Johnson’s dog, it is really surprising that it is not done at all. But probably not for the reasons most people expect. There is certainly a lot of stuff on the Internet that is not wrong. But there is also a lot that is not as good as it could be, and that should not surprise anyone. The unfortunate thing is that the amount of “pretty good, but not as good as it could be” tends to dominate the output of language models.
This is the big question facing language model developers. How do we get answers that are insightful, entertaining, and better than the average answer on the internet? The initial surprise has faded, and AI is being judged on its merits. Will AI continue to deliver on its promises? Or will we just say, “That’s boring, boring AI,” even as its results permeate every aspect of our lives? There may be some truth to the idea that we’re turning delightful answers into trustworthy answers, and that’s not a bad thing. But we also need delight and insight. How can AI provide that?
footnote
Boswell’s Johnson’s life (1791); perhaps slightly modified.