Skrible
Writing is an important part of our education, as it’s a skill required for a lifetime. When learning to write, children discover how to put their thoughts into words, and they acquire skills on how to be organised, creative and think critically. When we were asked to make a new tool which combines machine learning and writing, we were thrilled.
Skrible easily connects teachers and students, providing an accessible and understandable framework for teachers to customise their teaching. While we’ve already touched upon how Skrible works, and how it aims to personalise learning, this article explains the technical background of Skrible, and how we used natural language processing (NLP) to facilitate the learning process.
Making assignments smoother and faster
One of the goals of Skrible is to streamline the way teachers and students keep track of criteria, assignments, deadlines, and feedback. We started by looking at the current workflow for teachers and focused on how they create and distribute their assignments.
During the first interviews we learned that teachers share and collect assignments in very different ways, and they receive a myriad of documents in different file formats in return. With Skrible, we wanted to skip these tedious and time-consuming challenges, by presenting schools with a tool that makes it easy to share assignments, texts and feedback.
Machine learning to help human teaching
When the teacher creates an assignment, they have to make some sort of evaluation criteria based on a wide variety of options. These criteria form the basis to transparently evaluate the students and help the teacher keep track of the progression of each student:
Title. Reminds the student to assign a catchy title to each text.
Word count. Teachers can set a minimum and a maximum number of words they want the student to write. Skrible notifies the students if they exceed that number.
Spellcheck. Checks both in Norwegian Nynorsk and Bokmål.
Space count. Highlights incorrect use of spaces.
Capital letters. Shows correct capitalisation of proper nouns, and lets you know when there’s no capital letter at the start of a sentence.
Terms. Defines a list of words the students are supposed to use.
Repetition. Shows if a student has multiple repetitions of words.
Long sentence. Highlights lengthy sentences.
Paragraph. Shows if a paragraph is too long.
Text structure. Reminds the student to structure the text correctly.
Commas. Corrects the usage of a comma before “but”.
Full stop. Marks full stops in the title as errors.
Question marks. Reminds the student that a question needs a question mark at the end.
Exclamation marks. Notifies exaggerated use of exclamation marks.
Each of the criteria mentioned above can be analysed with the help of modern text technology techniques that simulate the way humans process language. We decided to extensively make use of the great Python NLP library spaCy.
Each NLP task starts with tokenisation, Part of Speech (PoS) tagging and dependency parsing the text. This information provides further information in subsequent analyses. Simply put, tokenisation is the process of finding meaningful words in strings of text. While PoS tags categorise each word in its syntactic class (noun, verb, etc.), dependency parsing defines the relationship between the words and therefore represents its grammatical structure.
Luckily, spaCy provides a pre-trained language model for Bokmål, the most used written variation of Norwegian. But for Nynorsk, the second written variation, we had to train our own model based on Nynorsk Universal Dependencies. For further exploration go to Github to learn more.
How to train a SpaCy Nynorsk language model
Training models in spaCy is a breeze. After the usual data preprocessing one can simply train a blank model via the command line with spaCy’s training script. In this setting, we also chose to improve accuracy by leveraging transfer learning and adding Fasttext’s word vectors:
# Download Nynorsk vectors & create Spacy model
wget -P fasttext https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.nn.300.vec.gz
python3 -m spacy init-model nn models/nn_vectors_ft_lg --vectors-loc fasttext/cc.nn.300.vec.gz
And actual training:
python3 -m spacy train nn --version=0.0.1 --vectors=models/nn_vectors_ft_lg models/nn_ud_fasttext_md data/nynorsk-ud-train.json data/nynorsk-ud-test.json
Voilà — we now have a Nynorsk model with tokenisation, Part of Speech tagging and dependency parsing ready to use for our text analyser.
Looking deeper into Norwegian spelling correction
Traditional spelling correction algorithms rely heavily on dictionaries to find the correct words.
There used to be a “spell-Norwegian project”, an open-source approach in building Norwegian dictionaries that are readable by all major open-source software, for both Bokmål and Nynorsk but sadly it is no longer maintained nor developed. The Norwegian Language Packs, used by Firefox or Chrome, are based on these dictionaries but are outdated spelling-wise. So we took matters into our own hands and compiled our own updated version.
The National Library of Norway is a great resource for finding Norwegian text corpus, annotated texts and word lists, such as Norsk Ordbank, reflecting the official standard orthography of both Nynorsk and Bokmål. We took the more recent word lists and enriched the open-source dictionaries to be up-to-date. Then we leverage the great “Hunspell”, a free spell checker and morphological analyser library, to check for correct spellings.
The methods described so far, don’t deal well with correct words that are not in the dictionary. To solve this, we found another way of enriching the data — by the usage of predefined entities. Entities are words or phrases that stand (fairly) consistently for some referent. In such a way an entity can refer to a person, city, building or country. Compiling a list of known entities enabled us to further improve the spelling correction within Skrible.
We also wanted to include another common source of error; composite detection. Germanic languages and Proto-Indo-Iranian languages like Sanskrit allow for (possibly) infinite constructions of compound words. By compiling a lite language model that relies on bigrams to detect possible composites, we helped ourselves to be able to see these errors.
Learning to learn
Skrible offers additional features than what is mentioned above. Teachers can easily provide feedback via the review functionality and communicate with students by adding comments. Together with our clients KF (Kommuneforlaget) and NTB Arkitekst we aim to support students and teachers when learning to master understanding, writing and creativity.
In the first month after launching in 2020, 3200 students at 44 schools in Bærum municipality had already started to use Skrible. Do you want to try the new writing tool? Feel free to contact the Skrible team!