Dimi Dimitrov

Data Act: A small step for databases, an even smaller step for the EU

by Dimi Dimitrov
February 3, 2022February 3, 2022

A map of the world with areas dense in Wikidata Items shown bright. 2021. Author: Addshore. License: CC-0

Today, the European Commission has leaked its proposal for a “Data Act”, a piece of legislation that is supposed to include a revision of the Database Directive and the sui generis right for the creators of databases (SGR) it establishes.

by Dimi Dimitrov
January 21, 2022January 21, 2022

Vote results during plenary session. Image: © European Union, 2022 – Source: European Parliament

Yesterday the European Parliament adopted its negotiation position on the EU’s new content moderation rules, the so-called Digital Services Act. The version of the text prepared by the Committee on Internal Market and Consumer Protection (IMCO) was mostly adopted, but a few amendments were added.

by Dimi Dimitrov
December 16, 2021December 17, 2021

Illustrated by Jasmina El Bouamraoui and Karabo Poppy Moletsane, WP20Symbols "Community 1", Public Domain Dedication (https://creativecommons.org/publicdomain/zero/1.0/legalcode )

The EU is working on universal rules on content moderation, the Digital Services Act (DSA). Its co-legislators, the European Parliament (EP) and the Council, have adopted their respective negotiating positions in breakneck time by Brussels standards. Next, they will negotiate a final version with each other.
While the EP’s plenary vote on the DSA is up in January and amendments are still possible, most changes parliamentarians agreed upon will stay. We therefore feel that this is a good moment to look at what both houses are proposing and how it may reflect on community-driven projects like Wikipedia, Wikimedia Commons and Wikidata.

by Dimi Dimitrov
November 8, 2021November 9, 2021

If the EU really wants to revamp the online world, it should start shaping legislation with the platform models in mind it likes to support, instead of just going after the ones it dislikes.

Whistleblowers are important. They often provide evidence and usually carry conversations forward. They might be able to open the debate to new audiences. I am grateful to Frances Haugen for having the courage to speak and the energy to do it over and over again across countries, as the discussion is indeed global.

On the other hand the hearings didn’t reveal anything completely new, we didn’t learn something we didn’t already know. We live in a time where the peer-to-peer internet has essentially been replaced by a network of platforms, which, in their overwhelming majority, are for-profit, data-collecting and indispensable in everyday life.

by Dimi Dimitrov
October 27, 2021October 27, 2021

Mishaps happen. The question is how to deal with them. (Image: "The crashed B-2 at Guam", Federal Aviation Administration, 2008, public domain)

There are many bots on Wikipedia, computer-controlled “user accounts” that perform simple, repetitive, maintenance-related tasks. Most are simple, trained to fix typos or using a list of blacklisted words to determine vandalism. ClueBot NG uses a combination of different detection methods which use machine learning at their core.

Bots on Wikipedia

A bot (a common nickname for a software robot) is an automated tool that carries out repetitive and mundane tasks. Bots are used to maintain different Wikimedia projects across language versions. Bots are able to make edits very rapidly, but can disrupt Wikipedia if they are incorrectly designed or operated. False positives are an issue as well. For these reasons, a bot policy has been developed.There are currently 2,534 bot tasks approved for use on the English Wikipedia; however, not all approved tasks involve actively carrying out edits. Bots will leave messages on user talk pages if the action that the bot has carried out is of interest to that editor. There are 323 bots flagged with the “bot” flag right now (and over 400 former bots) on English Wikipedia. On Bulgarian Wikipedia, a much smaller language version, there are currently 106 bot accounts, but only a number of them are active. Projects by smaller communities sometimes need to rely more on machines for page maintenance.

by Dimi Dimitrov
September 27, 2021September 27, 2021

Graffiti in Vitoria-Gasteiz, Zarateman, Creative Commons Zero, Public Domain Dedication

There is an idea to use a “section recommendation” feature to help editors write articles by suggesting possible sections to be added. But it is possible that its recommendations inadvertently increase gender bias. Here’s how we could deal with it.

by Dimi Dimitrov
September 15, 2021September 15, 2021

Al-Jazari's programmable automata (1206 CE): Realistic humanoid automata were built by craftsman from many civilizations and were believed to be capable of wisdom and emotion. [Public Domain, via Wikimedia Commons]

There is a machine learning service available to interested Wikimedia projects and communities called ORES. It aims to recognise if an edit, for instance on Wikipedia, is damaging or done in good faith. Of course, false predictions cannot be avoided and thus remain a major risk. Here’s how we try to handle it.

by Dimi Dimitrov
August 26, 2021August 27, 2021

Jan Davidszoon de Heem: "Vanitas - Still Life with Books and Manuscripts and a Skull", public domain

Just before the summer recess, the European Parliament’s Internal Market and Consumer Protection committee released over 1300 pages of amendments to the EU’s foremost content moderation law. It took the summer to delve into the suggestions and are ready to kick off the new Parliamentary season by sharing some thoughts on them. Our main focus remains on how responsible communities can continue to be in control of online projects like Wikipedia, Wikimedia Commons and Wikidata.

1. The Greens/EFA on “manifestly illegal content”

AM 691 by Alexandra Geese on behalf of the Greens/EFA Group
Article 2 – paragraph 1 – point g a (new)
‘manifestly illegal content’ means any information which has been subject of a specific ruling by a court or administrative authority of a Member State or where it is evident to a layperson, without any substantive analysis, that the content is in not in compliance with Union law or the law of a Member State;

Almost any content moderation system will require editors or service providers to assess content and make ad-hoc decisions on whether something is illegal and therefore needs to be removed or not. Of course, things aren’t always black-and-white and sometimes it takes a while to make the right decision, like with leaked images of Putin’s Palace. Other times it is immediately clear that something is an infringement, like a verbatim copy of a hit song, for instance. In order to recognise these differences the DSA rightfully uses the term “manifestly illegal”, but if fails to actually give a definition thereof. We agree with Alexandra Geese and the Greens/EFA Group that the wording of Recital 47 should make it into the definitions.

by Dimi Dimitrov
July 6, 2021July 6, 2021

Manuel de la bibliothèque publique, in public domain

The European Commission wants more European data (public, private and personal) to be shared for the purposes of innovation, research and business. It also wants to avoid a system where only a few large platforms control all the data. It thus wants to create mechanisms and tools to get there. That’s commendable! What the Commission proposes in the Data Governance Act (DGA), though, is at times very unclear.

Here is a breakdown of the European Commission proposals by sector, peppered with our take on some relevant aspects and support for some European Parliament and Council amendments.

Public Sector Data

DGA creates a mechanism for re-using protected public sector data (e.g. because of privacy rules, statistical confidentiality or IP) . Public sector bodies are to establish secure environments where data can be mined within the institution. Anonymised data could be provided through outside of the institution, if the re-use can’t happen within its infrastructure.

by Dimi Dimitrov
June 4, 2021June 21, 2021

Wikimedia organizational and user rights hierarchy, under CC0 1.0

In the second half of 2020 the Wikimedia Foundation received 380 requests for content alteration and takedown. Two were granted. This is because our communities do an outstanding job in moderating the sites. Something the Digital Services Act negotiators should probably have in mind.

See the organisational chart in full here

Wikipedia is a top 10 website globally anyone can edit and upload content to. Its sister projects host millions of files uploaded by users. Yet, all these projects together triggered only 380 notices. How in the world is this possible?

Sergey Pesterev / Wikimedia Commons

NASA Goddard Space Flight Center from Greenbelt, MD, USA, Public domain, via Wikimedia Commons

Michael S Adler, CC BY-SA 4.0, via Wikimedia Commons

Benh LIEU SONG (Flickr), CC BY-SA 4.0, via Wikimedia Commons

JohnDarrochNZ, CC BY-SA 4.0, via Wikimedia Commons

Markus Trienke, CC BY-SA 2.0, via Wikimedia Commons

Stefan Krause, Germany, FAL, via Wikimedia Commons

Charles J. Sharp, CC BY-SA 4.0, via Wikimedia Commons

Dimi Dimitrov

Data Act: A small step for databases, an even smaller step for the EU

DSA: Parliament adopts position on EU Content Moderation Rules

The EU’s New Content Moderation Rules & Community Driven Platforms

Editorial: The DSA debate after Haugen and before the trilogues

Meet “ClueBot NG”, an AI Tool to tackle Wikipedia vandalism

Bots on Wikipedia

Wikimedia Projects & AI: Designing a “Section Recommendation” tool without reinforcing biases

Wikimedia Projects & AI Tools: Vandalism Detection

DSA in imco: Three amendments we like and one that surprised us

1. The Greens/EFA on “manifestly illegal content”

Data Governance Act: Good Intentions, Bad Definitions

Public Sector Data

Takedown Notices and Community Content Moderation: Wikimedia’s Latest Transparency Report