After successfully releasing Danish and Norwegian BERT models, Certainly is ready with a model for Swedish, the Scandinavian language with the most speakers.
Certainly’s Swedish BERT model has been trained on a staggering 25 GB of raw text data. This is more than ten times more data than the previous largest Swedish BERT model.
As with the Danish and Norwegian models, it’s freely downloadable from Certainly’s GitHub profile here: Nordic BERT.
Certainly’s ambition is that the model will contribute to the Swedish Natural Language Processing community in the same manner that the Danish and Norwegian BERT models have been contributing in Denmark and Norway.
The hope is that Swedish data scientists will share their findings.
Swedish has a different set of characters than Danish and Norwegian. Apart from the usual English letters, Swedish uses the vowels Å, Ä, and Ö.
More than 10 million people speak Swedish, almost as many as Danish and Norwegian combined. Certainly is in the process of running a more in-depth analysis of the data for the different languages.
They believe that careful analysis will allow them to improve the quality of the training data.
As an example of Certainly’s current findings, consider the peculiarity of Lsjbot, a Swedish Wikipedia bot that skews automatically gathered datasets in Swedish.
Lsjbot is an automated article-creation program which mostly writes articles for Swedish Wikipedia. Take a look at the following chart of Wikipedia articles in different languages:
Do you notice something strange? Despite only having 16 million speakers in the southern part of the Philippines, Cebuano is the second most popular language.
The third most represented language by article count is Swedish. So what’s going on?
It turns out that most of the Swedish articles were contributed by the Swedish physicist Sverker Johansson.
Or rather, Lsjbot, an automatic robot that Sverker Johansson created. The robot reads data from a database and automatically writes and publishes articles.
The articles are automatically generated, so they all incorporate similar expressions.
For example, an article might display information about an animal, and feature the sentence “The average adult [Giraffe] is [4.6m-6.1m] tall and weighs [800kg]. Its diet mainly consists of [leaves, seeds, and fruit]”.
Lsjbot then replaces the variables within the sentence for different animals.
While this might be great for filling out a Wikipedia page, it isn’t beneficial for training Natural Language Processing algorithms.
The algorithm favors particular expressions because there is a high volume of similar sentences.
In turn, it skews the model and impacts performance negatively.
As to why is Cebuano the second most represented language, here’s a hint: Sverker Johansson’s wife is from the Philippines…
Besides training a Finnish BERT model, Certainly is going to work on running a more detailed analysis of the data for different languages.
By publishing high-quality data sets, Certainly hope to get data scientists from all over Europe to contribute to their efforts in improving Natural Language Processing for all European languages.
Article written by: Jens Dahl Møllerhøj