The language you use on the web is like your fingerprint, or gait. Machine learning and behavioral analysis algorithms will use it to trace you everywhere
In January, the era of cheating with impunity at Polish Universities has come to an end. A dozen of programmers from the Natural Language Processing Laboratory at the National Information Processing Institute – State Research Institute (OPI PIB), who designed the Uniform Anti-plagiarism System (JSA). The System, financed by the Ministry of Science and Higher Education since the beginning of the years allows the determination of the degree a thesis is authored by whoever signed it. Meaning: to detect plagiarism.
How does the Uniform Anti-plagiarism System work?
10 billion bites
More or less like this: we divide the text into shorter pieces: five, ten, or twenty sentences, so-called bites, and then we look for similarities between them, and pieces of analogous length which we have in the database. There is material to compare it to, for sure, we have gathered over 10 billion of such microdocuments – explains Marek Kozłowski, Ph.D., the Laboratory Manager.
The System cannot be cheated, e.g. by changing the word order, swapping one word with another. The Uniform anti-plagiarism System, by breaking down the text into specific words, creates disorderly collections of parts out of them
These billions of bites originate from ten enormous databses, such as The National Repository of Theses (over three million), the Polish semantic search engine NEKST (900 million documents from the Polish Internet), six Wikipedia language versions (including Polish), legal act databases, or up-to-date collected OpenAccess articles.
The sixth sense of the system
The System cannot be cheated, e.g. by changing the word order, swapping one word with another. The Uniform anti-plagiarism System, by breaking down the text into specific words, creates disorderly collections of parts out of them. And only those databases are compared with source texsts.
Marek Kozłowski, Ph.D.
But that’s not the end. The sixth sense which makes the system even more precise is a stylometric study, or – put differently – stylometric behavioral profile. Simply put, it is the study of the style in which a text was written – and drawing conclusions about the author.
We don’t have more data about the author’s style than his thesis which we are currently analyzing. We can find fragments which deviate from the thesis overall style. Assuming, of course, the thesis has a dominant style – says doctor Kozłowski
The fragments deviating from the norm are highlighted.
Fingerprint in every text
How does stylometric behavioral profiling work?
Throughout our lives, each of us develops a unique writing style. That is why, as readers, we hear the whisper of intuition that certain texts have been written by the same person – or that the text has been authored by someone else than whoever is signed at the bottom. After all, we see what kind of phrases someone uses, how he forms sentences, interpunction, etc.
We do this intuitively when we read a text – explains Kozłowski. A machine needs to have input data, on the basis of which it compares these parts over time. It needs to a set of characteristics in which certain vectors are created (ordered sets of characteristics). They describe a particular person’s profile, and subsequent texts are then compared with this profile. And this is what we call the stylometric behavioral profiling. We crate a behavioral profile, the one pertaining to a person’s demeanor. Specifically, how this person uses the written language.
But behavioral profiling has its weak spot too, e.g. people who are depressed change their language style – and this is not about expressing gloom or sadness
What traits are we talking about? For example, some people prefer shorter sentences (more full stops and capital letters will appear in the text then), others prefer longer, more complex (they use more commas, and fewer full stops and capital letters). Or, preferring particular parts of speech: some like nouns, while others prefer verbs in their texts. Pronouns are significant too. Some authors will use more I, others you, and others we or you.
The mirror of your thoughts
Prepositions can also be analyzed (everyone has their favorite ones) and the frequency of adjectives or participle nouns.
-These are all traits which describe our side of thought-formation. The algorithm creates the sets of these traits – adds Kozłowski.
When there is enough of text gatherd, written by a single person, the machine can learn to recognize the individual expression style. We all form the basic parts of our style when we are between 14 and 16 years old. The style is solidified around graduating from high school or the first year at the University. Not much changes after that – and only, if a person is consciously working on their writing style (as do writers and journalists). According to psychologists, people become more fluent in writing after 20, their vocbaulary range increases, but the style remains essentially the same.
Changes, of course they do occur. Statistically speaking, they are irrelevant – remarks Kozłowski. – We’re talking ten percent variation, at the most, in one direction or another.
Danish ghostwriter
Ghostwriter is someone who is paid to write for others, and once the job gets done, the ghostwriter agrees that whoever has paid for writing will be known as the author. Meanwhile, a program based on artificial intelligence uses exactly what scientists from the IT University of Copenhagen have been using since spring to help battle ghostwriting for Danish students. The phenomenon has become a plague in recent years. There is an official website on the Internet, Den Bla Avis, a written word market for students.
The algorithm of computer scientists from Copenhagen uses behavioral profiling and analyzes every thesis on the basis of its linguistic similarity against previous works of the same author. Its database contains 130 thousand theses written by ten thousand students.
When you gather a sufficient number of texts written by a particular person, the machine can be taught to recognize his/her expression style
The creators of Ghostwriter claim their programme can be used in detecting document forgeries – meaning, where human intuition or arduous scrutiny by experts was needed until now.
The only weak spot
But behavioral profiling has its weak spot too, e.g. people who are depressed change their language style – and this is not about expressing gloom or sadness. In depression, the whole outlook on the world changes.
As research shows, depressed individuals are less prone to referring to himself/herself in third person (he, she, they), while using the first person reference much more frequently instead (I, me). His/her world becomes monochromatic, devoid of finer shade and nuances, so he/she is more prone to use words such as ‘’always,’’ ‘’nothing,’’ ‘’so-so,’’ and ‘’entirely,’’ meaning absolutely categorizing.
The algorithm can reveal it as well, indicating we have depression, and then use this fact in the analysis of our text.
The last sieve and alerts
The last sieve of the Uniform anti-plagiarism System uncovers evident manipulations which were supposed to hide borrowings. They are the so-called ‘’blank character detectors,’’ i.e. rendering part of the text white by an author to make it invisible and microspaces (deleting spaces between words and creating clusters, e.g. ‘’thiswouldbealie.’’
At the end, the system establishes the value of alerts – warnings and alarms. Every university and department can adjust these paramteres to its standard. The default range is 40 percent, and the second was set to 70 percent probability that a thesis is similar to other materials.
Our research shows this is the optimal range – says Kozłowski. Thesis cite scientific sources in various ways, using various methodologies, this is why the evaluation of the JSA is not the final verdict. Our system is to serve as support for a supervisor and it is up to the supervisor if a given thesis has been plagiarized, or whether the borrowings are justified.
250 seconds, 100 servers, 40 terabytes
As a result of the amendment of the Law on Higher Education and Science, every written thesis has to be put through the JSA since the begining of 2019. Over 350 Polish Universities have already used the system. Until last year, before the JSA has been introduced, Universities used various systems, such as Genuino, OSA, or Plagiat.pl. Their comparison range was limited only to theses available at a given University, or federations of Universities using the same system. It was possible, then, to plagiarize a thesis from another University.
Thesis analysis to detect borrowings takes 250 seconds on average. The analysis is done by a cluster of 100 servers which manages over 40 terabytes of data. The System has detected a suspiciously large number of borrowings from other sources in almost 10 thousand cases.
Around eight percent of thesis has surpassed the warning threshold which tells us that over 40 percent of the text is similar to others in the database. A further 2.5 percent surpassed the alarm threshold, in their case, there was a 70, or greater similarity, detected. – says Kozłowski.
Likes: the whole truth about you
What is noteworthy, is that traces we leave behind us on the Internet which can be used to find us are not only words. You do not have to write anything. Clicking is enough. It is not just about the Internet cookies, on the basis of which algorithms create consumer profiles.
Six years ago, the scientists from Stanford and Cambridge Universities proved that the algorithm can guess the gender and political views of a person on the basis of |Facebook likes alone. In two cases out of three, the algorithm accurately predicted the marital status, in four out of five – faith, and sexual orientation in nine out of ten (65, 81, and 88 percent accuracy, respectively).
Your likes are like an exfoliating epidermis. On their basis, intelligent algorithms can determine what psychologists call the ‘’Big Five’’ personality traits
In 2015, a team of Cambridge scientists, including a Polish scientist – Michał Kosiński, Ph.D., proved that traces you leave on the Internet makes it possible to determine your personality type. Only ten likes are enough for the algorithm to identify it more accurately than your co-workers, and a few dozen – more accurately than your flatmate. 150 likes makes the computer know you better than your family member, and 300 – from your partner.
Playlist: are you intelligent?
On the basis of analyzing your Facebook likes, intelligent algorithms can determine what psychologists call the big five personality traits, i.e. the intensification or weakening of neurotic sates, extroversion, openness to experience, agreeableness and conscientiousness.
People don’t realize how much you can learn on the basis of Facebook profile alone which can be found on Spotify or YouTube. You think to yourself: at least someone will know what I listen to. As it turns out, based on the Spotigy playlist, one can glean a lot of additional data describing personality, intelligence, denomination, political views, sexual orientation – Michał Kosiński commented on research results.
By arranging an appropriate combination of these traits, it’s possible to zoom in on you in the sea of millions.
Black box: it knows how, but it doesn’t understand
That machine algorithms can recognize the traces of us on the Internet does not mean at all they understand anything out of it.
Machines are perfect at generalizing. They can read an author’s stylometric fingerprint from the text, they can recognize, what is in the picture, and accurately describe it. But they can’t draw logical conclusions or explain how they arrived at these conclusions. We give them a picture – and snap, they will describe: ‘’a cat and a dog are skateboarding.’’ It all works really nicely, because the algorithm taught itself to recognize dogs, cats, and skateboards on millions of examples. But they have no idea that the dog and the cat are alive, and the skateboard is used by humans to ride on it – clarifies Marek Kozłowski.
Neural networks are our black boxes. They perform certaint tasks for us, but we have no idea, how they do it. Kozłowski gives an example of an algorithm which was created in a laboratory he managed: based on the comments of Internet users he predicted the outcome of the recent Parliamentary elections. How is that possible, when s omany older voters aren’t active on the Internet? We don’t know.
Cooperation: Michał Skubik