Analysis Computer scientists have detailed ways in which AI language systems – including some in production – can be hoodwinked into making bad decisions by text containing unseen Unicode characters.
Account numbers can be switched around, recipients of transactions changed, and comment moderation bypassed by special hidden characters, we’re told. And it is claimed software built by Microsoft, Google, IBM, and Facebook can be potentially fooled by carefully crafted Unicode.
The issue is that ambiguity or discrepancies can be introduced if the machine-learning software ignores certain invisible Unicode characters. What’s seen on screen or printed out, for instance, won’t match up with what the neural network saw and made a decision on. It may be possible abuse this lack of Unicode awareness for nefarious purposes.
As an example, you can get Google Translate’s web interface to turn what looks like the English sentence “Send money to account 4321” into the French “Envoyer de l’argent sur le compte 1234.”
This is done by entering on the English side “Send money to account” and then inserting the invisible Unicode glyph 0x202E, which changes the direction of the next text we type in – “1234” – to “4321.” The translation engine ignores the special Unicode character, so on the French side we see “1234,” while the browser obeys the character, so it displays “4321” on the English side.
It may be possible to exploit an AI assistant or a web app using this method to commit fraud, though we present it here in Google Translate to merely illustrate the effect of hidden Unicode characters. A more practical example would be feeding the sentence…
…into a comment moderation system, where
U+8 is the invisible Unicode character for delete the previous character. The moderation system ignores the backspace characters, sees instead a string of misspelled words, and can’t detect any toxicity – whereas browsers correctly rendering the comment show, “You are a coward and a fool.”
Thus, you’re able to trash-talk someone without setting off the moderation system using hidden Unicode characters in your message or post. This has been demonstrated, to varying degrees, against IBM’s Toxic Content Classifier and Google’s Perspective API.
Crucially, however, these Unicode shenanigans abuse machine-learning systems’ handling of input text rather than exploiting weaknesses within the depths of a neural network.
Our attacks work against currently deployed commercial systems
It was academics at the University of Cambridge in England, and the University of Toronto in Canada, who highlighted these issues, laying out their findings in a paper released on arXiv In June this year.
“We find that with a single imperceptible encoding injection – representing one invisible character, homoglyph, reordering, or deletion – an attacker can significantly reduce the performance of vulnerable models, and with three injections most models can be functionally broken,” the paper’s abstract reads.
“Our attacks work against currently deployed commercial systems, including those produced by Microsoft and Google, in addition to open source models published by Facebook and IBM.”
A homoglyph adversarial attack that is easy to perform in Google Translate involves switching the first letter of the English alphabet, a, to the Cyrillic а in a word. They look the same to the human eye though their Unicode characters are different.
Using the English letter a in the word “paypal” and translating it into Russia in Google Translate gives you the correct translation “PayPal”, but replace the first occurrence of a with the Cyrillic a, and Google will spit out “папа”, meaning dad or father. It thus may be possible to exploit this in an AI assistant or web app to redirect payments and suchlike.
Screenshot of Google Translate mistaking the English word paypal for papa in Russia due to a homoglyph attack
Spam emails may be able to evade detection, and hate speech may be able to slip through moderation, if miscreants use these techniques, Nicolas Papernot, co-author of the paper and an AI security researcher at the University of Toronto’s Vector Institute, told El Reg. Papernot referred to these text-based Unicode attacks as “bad characters.”
“The attacks presented in our paper are applicable to real-world applications; as part of our responsible disclosure, a major mail provider made changes to their spam filters and a cloud provider modified their machine-learning-as-a-service offering,” Papernot told us.
“Bad characters [are applicable] everywhere machine learning is used for natural language processing – examples of such systems are toxic content detection, topic extraction, and machine translation. Bad characters are also agnostic to machine learning tasks and pipelines – they exploit discrepancies between visual and logical representation of characters rather than inconsistencies specific to a given model as was targeted by prior work on adversarial examples.
“This makes bad characters more practical to use.”
It may even be possible to use invisible Unicode for good as well as bad, he added.
“When machine learning is used for questionable purposes, such as censorship, bad characters could be leveraged by human rights activists to evade censorship,” Papernot told us.
“In another example, law firms that rely on natural language processing to process large corpus of documents efficiently are also exposed: a malicious entity could submit documents with bad characters to evade scrutiny from the law firm.”
Developers of AI-powered software should either filter out special Unicode characters – such as backspaces – entirely, if feasible, or pass the Unicode through a parser before it’s given to a neural network, so that ultimately what the neural net sees and makes a decision on is what the user also sees and interacts with in the browser or user interface. Changes in language, such as from English to Cyrillic, should be detected and handled appropriately.
Given that models potentially susceptible to these attacks may already be widely used in production, we may see successful exploitation in the real world. ®