Wikipedia-linked Dictionary of Madagascar’s National Language Full of Bot-Generated Errors

Jimmy Wales, co-founder of Wikipedia
GIUSEPPE CACACE/Getty

An audit of the Wikipedia-affiliated dictionary Wiktionary for the Malagasy language, the national language of Madagascar, found the site to be littered with errors. The errors were attributed to bots mostly programmed by the site’s sole active administrator, which mass-produced almost all six million entries included on the Malagasy Wiktionary. Noting the problems were too comprehensive to be resolved manually, the audit recommended deleting most entries on the site.

The audit of the Malagasy Wiktionary was initiated as problems recently revealed at the Scots language Wikipedia brought attention to low-activity sites affiliated with Wikipedia. Scots Wikipedia was found to have included many articles not written in genuine Scots due to having been written by an American teen unfamiliar with the language.

Wiktionary is a dictionary site using the same open-editing wiki format as Wikipedia. Both are owned by the Wikimedia Foundation. As with Wikipedia, there are Wiktionary sites for numerous known languages, including those with few native speakers. Malagasy, the native language of Madagascar, is primarily spoken by the African island country’s 25 million inhabitants. Despite having a relatively small number of native speakers, the Malagasy Wiktionary currently has the second-most entries out of all Wiktionary sites behind only the English-language version and has the third-most overall edits. At the same time, the site has only 20 active users.

Most of those pages and edits were made by several bots, particularly those programmed by “Jagwar” who identifies himself as a native speaker of Malagasy. In a bio on the Malagasy Wikipedia, Jagwar states he was educated at a French-speaking school in Madagascar and has lived in France since age 10. He claimed his writing proficiency in his native tongue was “catastrophic” in his early youth, but had since improved. Jagwar is also one of the Malagasy Wiktionary’s two administrators, users charged with overseeing sites who can suspend other users and delete pages, and is the only administrator active on the site.

In their audit of the Malagasy Wiktionary, administrators from the English Wiktionary noted Jagwar’s near-monopoly on the site’s content arising from his use of bot accounts. Combined with another users’ bot that stopped operating in 2017, bots have produced over 6 million entries. Jagwar is thus noted as responsible for nearly all entries on the Malagasy Wiktionary. Relying on the assistance of an unnamed fluent speaker, the audit highlights numerous repeated errors introduced by bots. Examining a random sample of a hundred entries on the site, only one of which was edited by a human, the audit found nearly half were unusable and false with less than a quarter being correct and usable.

Of the errors identified, the most serious involved entries providing Malagasy definitions for words in other languages using machine translations of other Wiktionaries, according to the audit. In one case, a term from the Bantu language Lingala meaning “to make different” was defined in Malagasy as “vomit” instead. More than a quarter of the hundred sample cases were erroneous, but not completely flawed, such as an Azeri term meaning “relative” or “kinsmen” including definitions in Malagasy for “friend” and “ally” as well. Some terms were technically accurate yet omitted vital context, such as an English-language racial slur being defined using just the Malagasy term for that race.

Problems created by these bots included what the audit calls “bizarre bot errors” where apparent bits of code are posted and linked in the definitions alongside the Malagasy words. Tens of thousands of entries were identified as having such errors. Other errors are attributed by the audit to bots using dictionary definitions, including one of Jagwar’s bots creating a page for “singing traditional sakalava accompany the drum” and identifying the nonsensical string of terms as a French word. Further problems arose from bots creating pages based off other Wiktionary sites as entries later deleted from those sites for being erroneous remained on the Malagasy Wiktionary.

One major problem identified was a lack of definitions for certain words, attributing most cases to copyright removals. In thousands of cases, entries are noted as effectively lacking definitions due to errors, such as defining a Malagasy word by just repeating the same word. A particularly unusual case involved the entry for the Malagasy term “tamboho” containing two sections for definitions of the word and those definitions simply being the word “tamboho” repeated twice. Erroneous translations were another issue cited. For instance, the entry for the Malagasy term for “the” included “Internet” and “orchestra” or “banana” for Afrikaans and Romanian translations respectively.

Citing these problems the audit recommended taking two sweeping actions to address the errors introduced by bot-creation of pages. The first recommendation was to delete all entries from the Malagasy Wiktionary that provide Malagasy definitions for non-Malagasy terms that were created by the identified bots and the second was to delete all Malagasy entries created by those bots that lacked definitions. Jagwar’s bot was also noted as having made edits over several years at the Cherokee-language and Kurdish-language Wiktionaries and the audit suggested those contributions be reviewed. Most entries on these other Wiktionary sites were only edited by bots and contained similar errors as on the Malagasy Wiktionary.

Polling of users from various Wikipedia-affiliated sites responding to the audit showed near unanimous support for implementing both recommendations with only one user suggesting they should prioritize “growing the Malagasy Wiktionary community” instead. Numerous users objected to this suggestion by noting the difficulty of getting contributors who would be interested in manually reviewing and cleaning up millions of entries on a site, particularly given very few fluent speakers would be able to contribute to what users argued would likely be a years-long undertaking.

Jagwar responded to the audit by insisting he acted in good faith and described critics as merely annoyed that a Wiktionary site contains millions of bot-created entries with what he called “a significant percentage” of low-quality content with him placing “low-quality content” in scare quotes. Past blog posts by Jagwar were similarly dismissive of widespread errors, one claiming what he portrayed as a modest “5% error rate” and acknowledged in another blog post there could be “thousands of pages of potentially wrong information” yet suggested it also meant more accurate entries. One user countered that Jagwar’s actions “have been an utter waste of time and do a disservice to Malagasy speakers and the Wikipedia movement.”

Editors behind the audit launched it after the Scots Wikipedia controversy in August. In that case, an American teenager who acknowledged not being fluent in the language became one of the main contributors to Wikipedia’s Scots-language variant. As in the Malagasy Wiktionary case, widespread errors that veered into the comical were identified with the recommended action being mass-deletion and removal of the user’s contributions. Such problems have plagued Wikipedia sites and affiliated sites that receive less widespread participation. Wikidata, a widely-touted site affiliated with Wikipedia, has been prone to hosting long-standing vandalism. This included the entry for First Lady Melania Trump labeling her a “porn star” for a week.

Bot editing and automated editing is controversial on Wikipedia and affiliated sites as well. Numerous bot operators on Wikipedia have been sanctioned or banned by Wikipedia’s Arbitration Committee, likened to a Supreme Court, due to problems created by their bots and automated editing tools. In spite of these problems, the Wikimedia Foundation has committed to using automation to identify “harassment” on their sites, including Wikipedia, and positively cited this commitment in a statement endorsing the Black Lives Matter movement. One algorithm the Foundation co-developed with Google to identify “toxicity” was shut down last year after it was found to not identify anti-Semitic remarks as toxic and automatically treated comments towards woman as more toxic than comments towards men.

Reliance on Wikipedia, as well as low-activity affiliated sites, by media outlets, academia, and Big Tech, have ignored these recurring problems and instead generally treat Wikipedia and related sites as reliable sources of information. On several occasions this has led to them spreading false information and hoaxes.

T. D. Adler edited Wikipedia as The Devil’s Advocate. He was banned after privately reporting conflict of interest editing by one of the site’s administrators. Due to previous witch-hunts led by mainstream Wikipedians against their critics, Adler writes under an alias.

COMMENTS

Please let us know if you're having issues with commenting.