Deedy Das, an AI investor at Menlo Ventures, says the new tokenizer has a total of 200,000 tokens, with about 25% in languages other than English. He used a language filter to count the number of tokens for different languages, and besides English, the top languages were Russian, Arabic, and Vietnamese.
“So I think the main effect of tokenizers is not to dramatically improve the quality of these languages, but to reduce costs in these languages,” says Das. If LLMs have better, longer tokens in languages other than English, they can parse prompts faster and charge users less for the same answers. With the new tokenizer, “we can achieve almost four times the cost savings,” he says.
Das, who also speaks Hindi and Bengali, looked at the longest tokens in those languages. The tokens reflect the debate going on in that language, so they include words like “Narendra” or “Pakistan”, but they also include common English terms like “Prime Minister”, “university” and “international”.” It comes often. They also do not represent the issues surrounding Chinese tokens.
This probably reflects the training data for that language, Das says. “My working theory is that websites in Hindi and Bengali are very rudimentary. Like (mostly) news articles. So I expect this to be true. There aren’t many spam bots and porn websites trying these languages. “Most of it is conducted in English.”
Contaminated data and lack of organization
However, in Chinese the situation is significantly different. According to several researchers who examined the new token library used in GPT-4o, the longest tokens in Chinese are almost entirely spam words used in pornography, gambling, and fraud situations. Even shorter tokens, such as three-character Chinese words, reflect the topic to a large extent.
“The problem is clear. The corpus (tokenizer) used for training is not clean. English tokens look good, Chinese tokens don’t,” says Cai from Princeton University. It’s not uncommon for a language model to crawl through spam when collecting training data, but it usually requires significant effort to clean the data before it can be used. “For the Chinese, it’s possible they didn’t delete their data properly,” he says.
The content of these Chinese tokens may suggest that they are contaminated by certain phenomena. This means that websites hijack irrelevant content in Chinese or other languages to increase spam messages.
These messages are often advertisements for pornographic videos and gambling websites. It could be a real business or just a scam. And that language can be embedded into content farm websites, or sometimes legitimate websites, so that they get indexed by search engines, bypass spam filters, and appear in random searches. For example, Google indexed one search results page from the National Institutes of Health website, which listed a Chinese-language pornographic site. The same site name also appeared in at least five Chinese language tokens in GPT-4o.