Official website for Linux User & Developer
FOLLOW US ON:
Nov
18

Wikidata – Wikipedia’s Game-changer

by Karl Beecher

Wikidata is one of the biggest technical overhauls of Wikipedia in its history and the ripples of change will reach far beyond its own shores. Dr Karl Beecher investigates…

Media coverage of problems with Wikipedia is usually quite political. Questions over Wikipedia’s editorial control, bias in certain sensitive articles, wholesale copying of content by lazy students; all these innuendos seem to accompany the very mention of the world’s most famous online encyclopaedia.

But whatever your opinions of Wikipedia politics, it’s hard to deny the technical achievements of the project. Masses of information in dozens of different languages, organised into cross-referenced articles, complemented with multiple media and all available freely at the click of a mouse. It should come as no surprise then that Wikipedia struggles with problems of a technical kind too. One particularly big headache comes from the structure (or lack thereof) of the information that populates its pages. The minimal existing structure tends to make the maintenance of all those gigabytes of data very laborious indeed. But Wikidata, a recently launched initiative being worked on by the German Wikimedia chapter, aims to ease this burden. Its tagline is no less than ‘A Game-changer for Wikipedia and Beyond’.

Where would we be without Wikipedia? It’s hard to believe that only ten years ago we actually had to know things or have a very large bookshelf. There was no logging onto Wikipedia in the dark days of the 1990s when a bit of knowledge escaped you. To recall the name of John Lennon’s first wife you had to be a real Beatles fan. For the capital of Romania you had go and dust off the atlas. And when you did your homework… well, you get the picture.

Wikipedia Wikimedia
Items as they will appear under Wikidata

Just a few years after its birth in 2001, Wikipedia has become one of the most famous and frequently visited sites on the web. There’s little doubt that it now serves as one of the main stops on many of our individual little quests for knowledge, being the sixth most visited site in the world and by far the most popular repository of general information. It can actually be difficult to avoid when searching the internet, because a Wikipedia article will so often be among the first results you get back. Hardly surprising though. Wikipedia currently hosts over 22 million articles on every conceivable topic in 285 different languages and provides free, universal access to them all. And this freedom isn’t just limited to reading: one of the site’s most famous policies allows anyone to edit any of the content (aside from a few ‘locked’ articles). With only a computer and access to the internet, you too could become a contributor to an encyclopaedia. This all makes Wikipedia a poster child of the web’s collaborative nature.

Indeed, the very openness that makes some people marvel is troublesome to others. These sceptics criticise the potential for inaccuracies and plagiarism, as well as the generally poor quality of some content. Wikipedia has responded by implementing quality controls – mainly in the form of layers of administrative users with extra editorial powers – and these come in for almost as much criticism as the problems they’re trying to solve. Those at the opposite end of the critical spectrum decry the excessive amounts of control, highlighting the rise of inequality between contributors, the emergence of power structures, and the stifling effects of rules and processes.

Yet whatever your opinion of its organisation, there’s little room for debate over the need for a large body of overseers when you consider how enormous the challenge of maintaining Wikipedia is. Regardless of the structure of its people, Wikipedia’s databases have vast amounts of complex information requiring continuous attention and maintenance. This raises a refreshingly technical series of problems, instead of the usual political wrangling that dominates the discussion. There’s no shortage of technical problems to choose from, but a new effort by Wikimedia Deutschland – the German chapter of the
charitable foundation behind Wikipedia and several other collaborative Wiki-based sites – is attacking a sweet spot in the complexity of Wikipedia. The effort is called Wikidata.

What is Wikidata?

A few difficulties stand out in the day-to-day running of Wikipedia. For Wikipedians who pride themselves on their project’s international reach, it’s an especially jarring problem that only four languages dominate the content. The English, German, French and Dutch Wikipedias (each language version is referred to as its own Wikipedia) each have over a million articles, but most others have far fewer, leaving many people around the world unable to read the content and join in the encyclopaedic fun. Another big difficulty is the amount of stuff that has to be done by hand. That’s not really surprising, given that it’s composed mainly of prose articles, the production of which is still far beyond automation (thankfully, for professional writers). This also means that the actual content is not easily machine readable, therefore limiting the potentially clever uses of all that lovely data.

Yet data will be key to the solution Wikidata provides, as Lydia Pintscher, Wikidata’s Head of Communications, makes clear. Bubbly, friendly and outgoing, Lydia fits her role of Wikidata’s head of communications to a tee. It’s her job to facilitate the community’s interactions and publicise the project.

“Basically, this is Wikidata,” she explains, producing a handy leaflet. On the leaflet is a table that any regular Wikipedia user will recognise. It’s the familiar sidebar that many articles use to contain summary data. “We call this an item, it’s basically a class. At the moment, this kind of information is unstructured and just plain text.” Indeed, each language version of an article has to have its own translated and localised copy of summary data like this. That’s a problem when you remember that most languages on Wikipedia struggle to get enough contributors who can produce an appreciable number of articles. “Instead of that, the idea is to structure the information in these items and put it all in a central database that every language version of Wikipedia can use.”

The consequences promise to be impressive. For starters, a lot of labour will be saved by having only a single copy of the data. For example, the UK recently conducted a census and is releasing the results in stages. If the revised population figure is released before Wikidata is finished, then the article on the UK will need updating separately in each Wikipedia. After Wikidata’s rollout, that figure will be stored only once in a database and instead linked to by every article. Therefore a single update will propagate throughout all Wikipedias. This way, all language versions will be able to tap into this central database, so that they can at least build up the foundations of missing articles easily. And it won’t be just other Wikipedias which can query the database. Anyone in the world will be able to access Wikidata, a treasure trove of information that will surely interest any online service wanting to provide you with answers (see ‘The Female Mayors Problem’ below for more on that). Make of it what you will that Google is one of Wikidata’s main backers, helping to fund the €1.3 million costs.

The Female Mayors Problem

How Wikidata is a search engine’s dream

Not only will the data in Wikidata be centralised, it will also be structured and typed. The information will then make possible the semantic search of huge swathes of human knowledge, a very exciting prospect for any search engine. Instead of simply matching a search query against plain text, a search engine can actually know the semantic meaning of information and process it accordingly. The classic scenario used to explain it is the Female Mayors Problem. The question ‘How many mayors in the world are female?’ becomes easily answerable because each mayor in the world will have their own item in a central database, which includes typed information such as their age, time in office and their gender. Gaining the answer to the Female Mayors Problem then becomes nothing more difficult than querying a database. What’s more, the information is localisable, so each item can be provided in a range of languages, formats and measurements.

Wikipedia Wikimedia
The Scrum-style task board on the wall of Wikidata's main office

Making it happen

What’s going on behind the scenes? Again, the best person to ask is Lydia Pintscher. She’s one of a 13-member core team based in Wikimedia Deutschland’s Berlin headquarters. Aside from a couple of project managers, the rest of her colleagues are programmers.

As Lydia shows me around, the environment is very much like that of a software development project. Desks are laid out in a large open space, at which people quietly code away on their workstations. On the wall, a huge organisational chart betrays the team’s Scrum style of development. Task descriptions are pinned to a long whiteboard by magnets. Each magnet has a picture with the face of the developer who owns the corresponding task. The whiteboard is itself divided into three sections: ‘To Do‘, ‘Doing’ and ‘Done’. Is this all just a glorified programming project then? Well, no, not when you view the project in its entirety. There’s still much to be done and, as of August 2012, Wikidata is just finishing up its initial phase, the first of three.

“The first phase is probably only interesting to people inside Wikipedia,” admits Lydia, explaining how phase one is going to tidy up the inter-language links. “In the source code of every article there are links which point to all other versions of the article which are in other Wikipedias. That’s not so clever. There’s a huge amount of effort maintaining all these links. So phase one is putting all of these links into a central database, so there’s no duplication.”

The team have just recently completed phase one right on schedule. Now things start to get interesting for the wider world, because in phase two the infoboxes seen earlier will finally become available. Editors will be able to add and update the data, as well as include the information in Wikipedia’s articles. This work will take them up to the end of 2012.

In the final phase, the Wikidata team are going to focus on automatic list creation. There are already an endless number of lists littered throughout Wikipedia, each of which gathers together links to related articles, such as the ‘List of Presidents of the United States’. Some lists are long, some lists are short. Some are rather obscure (such as the ‘List of fictional ducks’ article). Some are even lists of lists. Regardless of their nature, they all have one thing in common: they are static. Every list is assembled and maintained by hand. Although it’s yet to be planned in detail, phase three will allow the dynamic creation of lists by using the structured information in Wikidata. Look out for it at phase three’s scheduled completion date in March 2013.

The future

The completion of phase three doesn’t necessarily have to be the end of it. Lydia is most enthusiastic about the prospect of continuing Wikidata beyond its scheduled end date, but the limiting factor for the non-profit Wikimedia Foundation is, of course, money. For big efforts like this, Wikimedia needs to employ at least a core of paid people. The current sponsors – Google, the Allen Institute for Artificial Intelligence, and the Gordon and Betty Moore Foundation – have all stumped up a total of €1.3 million for the scheduled work, but further funding is being sought.

“We’re just laying the groundwork,” says Lydia. “There’s still much more opportunity after that.” Such as? “We submitted an idea to the Knight Foundation which proposed that Wikidata become the central resource for global identifiers on the web. For example, news agencies talk about a lot of similar things but find it really hard to exchange information because they have no common vocabulary. Instead, each agency could refer to a uniquely identifiable item stored in Wikidata.” Unfortunately in this case, the Knight Foundation decided against funding the idea, but, as Lydia says emphatically, there are so many different directions this work could be taken in future.

Certainly, Wikidata will not come to an end for the want of new ideas.

  • Tell a Friend
  • Follow our Twitter to find out about all the latest Linux news, reviews, previews, interviews, features and a whole more.