“Languages are equal, but of course, not really from an economical point of view”

In the emerging world of AI, the machines, and the numbers that describe them, are big. Really big. Terabytes of data go into a model. Weeks are needed to train it. Millions of dollars spent. Trillions of parameters.

Followers of the Overtone story are likely familiar with Open AI’s GPT-3, Wu Dao, or other efforts that have shown impressive performance in producing human-like speech. But despite all the expansiveness, there are still limitations facing “large language models.” One of them is language.

Last week I had the chance to chat with Dr. Jochen Leidner, a professor at Coburg University of Applied Sciences in Germany who has spent years working with artificial intelligence and language in positions such as Director of Research R&D at Thomson Reuters.

Trained as a computational linguist, Leidner has now returned to academia, and part of his work will be to build models that are not just in English (or Mandarin).

“Now I’m living in Germany, and in particular in a Germany that is part of Europe. I’m of course interested in all European languages, including my native language, where a lot of the resources are not available at the same quality that they are for English. So there’s some catching up that needs to be done. And I want to dedicate some of my energy to that part,” he said.

The creation of underlying language models matters not just for the glory of having made something cutting edge. Large language models are the basis for other applications that tackle specific use cases, like Overtone’s algorithm that can sort different sorts of news articles into different groups based on their journalistic qualities. Without the underlying model, it is difficult to create technology that can solve real world problems for people who speak a certain language.

While Leidner’s target of German, the language for an estimated 2% of all internet users, does already have some work done in it, there are other “low resource languages” where academics dedicate their research. 

“That is very important. Because ultimately, you know, languages are equal, but of course, not really from an economical point of view. There’s English and Mandarin Chinese and they are spoken by most people on earth and everything else is sort of minor stuff in comparison,” Leidner said.

The world of smaller languages is fascinating and may yield unexpected insights. The professor remembered a team at University of Sheffield tasked with building natural language processing components for the language of Cebuano, of the southern Philippines. They used one in English, which tagged parts of speech, just to see how poorly it would perform, only to find that they were able to use it with very few changes.

But how do you approach building a language model from scratch? Languages are large, unwieldy creatures, and German includes everything from Kant to Kraftwerk. Leidner is currently at work securing text to feed into the first version of his German and says the best approach is to take a “snapshot” of the language today.

“Instead of modeling the language, say German or English, you don’t need to model its history. You can take a synchronic approach, as linguistic scholars say, and look at the current snapshot. Now what this means for machine learning is that your model will be quite stale. So in a few hundred years you may need to redo the process.”

Even limiting the data that you need to the language currently in use today still leaves questions of what to leave in, what to leave out, and a particularly tough one: how much data is enough data?

Leidner agress that there is always a need to keep collecting. However, he also is concerned about the growth of large language models, which can increase their performance on certain benchmark tests just by growing and becoming larger and larger.

“These models getting bigger and bigger is also something that immensely frustrates me because this is actually a bit of a waste of energy. And scientifically suboptimal because with Occam’s Razor you know that a smaller model that can do the same explanation is the better model,” Leidner said.

“We need to have alternative benchmarks where the nimbleness of models, that parsimonious property, is rewarded as well.”

It’s one of multiple problems that people have pointed out with the current approach to building models, from worries about overfitting based on test sets to concerns about climate change and the potential for new tech to perpetuate biases.

For Leidner’s upcoming German model, however, evaluating performance is still a few steps off. He is currently occupied with finding the right sources for his data. Maybe Overtone readers can help.

“if anybody listens in and has plenty of German text that you own, please get in touch,” he said.