Certainly has trained the most advanced Danish AI language model yet.
The Certainly BERT model has been fed with a staggering 1.6 billion Danish words and it is also available as open source.
Google’s BERT model is one of the most well-known machine learning models for text analysis.
The search giant has released English, Chinese and multilingual models. But now, Certainly are the first to release an extensive open-sourced Danish BERT model.
Academics from Copenhagen University and Alexandra Institute have stated that the language model performs better than models by Google, Zalando, and Stanford University.
The conclusions of the study can be found in DaNE: A Named Entity Resource for Danish
The code is downloadable for free from Github.
Certainly has built upon Google’s BERT model because many Danish private companies, institutions, or NGOs in need of AI in Danish could greatly benefit from it.
This makes Certainly one of the very few companies in Denmark to improve and support Danish AI by publishing open-source code.
This is no small thing. Firstly, this next step in the development of the Danish BERT model is highly useful for the whole Danish AI and machine learning ecosystem.
Secondly, it provides inspiration to the whole industry by democratizing AI and making new updates publicly available, for everyone.
“Certainly might be the single company in Denmark lifting the community with open source code,” said Jacob Knobel, CEO of AI consultancy from Datapult and a Forbes 30 under 30 in 2016.
“It is both important and inspirational for the industry,” he finished.
BERT is an acronym for “Bidirectional Encoder Representations from Transformers”. It is a deep neural network that is useful for Natural Language Processing (NLP).
The network has learned about Danish grammar and semantics by reading vast amounts of Danish text.
When working with AI language models, part of the challenge is to collect huge amounts of text.
This is needed to make an extensive model.
Certainly has managed to overcome the obstacle by turning the model into a massive bookworm.
Certainly’s Danish BERT model has read 1.6 billion words, equivalent to more than 30,000 novels.
Although this might sound like a lot, the model could have read even more, but it is difficult to find much more publicly available Danish text.
The general language understanding capabilities of the model are useful for text analysis pipelines.
The model reads texts and returns vectors, which are points in a coordinate system.
The shorter the distance between the points returned by two different texts is, the more equivalent their meaning is.
You can therefore use the vectors to figure out if different pieces of text are related.
By combining the model’s general language understanding with, for example, data and knowledge of the positivity and negativity of the texts, the BERT model can help with sentiment analysis, entity extraction, and all the other disciplines in Natural Language Processing.
The Danish BERT model is useful for sentiment analysis in Danish.
For instance, it can analyze different prejudices in a text, define the purpose of a text, context, and point out relevant words.
This is useful to multiple industries such as e-commerce, finance, the tech industry, and the public sector.
At Certainly, we believe that it is crucial for countries with ‘smaller’ languages, or less widespread languages, to secure themselves and make sure that they have a part in the global economy.
One of the ways of doing this is by using and taking advantage of the endless opportunities that come with Artificial Intelligence.
Furthermore, we think that it is important that it is not only up to the big international players to determine where, when and how Danish organisations can benefit from these technological achievements.
This would pose the risk for Denmark of being left behind in the AI race.
“It’s vitally important for people in Denmark to have access to the benefits that language technology has brought to the English-speaking world,” said Leon Derczynski, PhD, an Associate Professor in Natural Language Processing at the IT University of Copenhagen.
“Seeing game-changing advances like Danish BERT come from the commercial sector, through Certainly, is a hugely positive sign.
“It clearly puts the company ahead of the curve in today’s Danish AI,” he finished.
Work has since been completed on BERT models for
Google has released a multilingual BERT model, but it is trained in more than a hundred different languages.
Danish text, therefore, only constitutes 1% of the total amount of data.
The model has a vocabulary of 120,000 words, suffixes, and prefixes.
It divides rare words so that “Inconsequential”, for example, becomes “In-”, “-con-” and “-sequential”. These kind of word divisions occur among all the different languages.
Google’s model therefore has room for about 1200 Danish words. In comparison, Certainly’s model has a vocabulary of 32,000 Danish words.
Firstly, it reads a sentence, e.g. “I like Chinese food, especially spring rolls.”.
Then, it hides some of the words from itself: “I like [HIDDEN] food, especially spring rolls.”
Next, it tries to guess the hidden word. If it guesses wrong, it adjusts so that it gets better the next time.
If it guesses correctly, then the model knows that it has understood the meaning of the text. In the example, the model learns that spring rolls are associated with Chinese cuisine.
Afterwards, the model would read the next sentence in the text, for example: “That’s why I often do my grocery shopping in the Asian supermarket”.
The model also reads a random sentence from another book: “At 7 o’clock, Jane Doe ate dinner”.
The model then tries to figure out which of the two sentences is the correct one that would logically follow the first sentence: “I like Chinese food, especially spring rolls.”.
In line with our mission at Certainly to develop and make Danish AI publicly available, it only made sense that the Danish Certainly BERT model would be open source.
This means that others can further develop it and use it to improve their products and services as well as produce new solutions.
The model and the instructions for data scientists and engineers are available for free on Github. We hope that you will support Danish AI by sharing the link in your organization.
If your organization needs something industry-specific and you don’t have the time, ability, or resources to build it yourself, we can set it up for you on our platform.
Just get in touch with us at hello@certainly.io.
Follow our blog to keep an eye on the latest AI news, Conversational Chatbot best practices, and much more.
Since this blog was published, Certainly have completed work on BERT models for Swedish and Norwegian, with more to follow.