Sunday, June 26, 2016

#Wikidata supports over 300 languages - implications

Wikidata is a single project that supports over 300 languages. The aim is that all data is usable in any and all of them. One important consequence is that for each item 300 translations of the label are needed.

Obviously less items is more. Each item has to have a purpose that is clear, obvious and cannot be expressed in another way.

For this reason I am opposed to the addition of all kinds of subclasses that add no value. There is no point to "APA Award". It is an award that is conferred by the American Psychiatric Association (APA) and it can be easily described in two statements.

It makes it extra hard to add translations. It is relevant to know that it is an award. It is important to know who conferred it but there is no point having this expressed in a combined item.  There is no point, it does not fit with the work that is done on awards. It makes Wikidata less usable and consequently such items need to be deleted.

Saturday, June 25, 2016

#Wikidata has a CC-0 license. This should not change II

Wikidata is becoming a repository where people may choose to share their data .. or not. When they do not want to share their data with Wikidata, it is their choice and that is fine.

The bottom line of copyrights for databases is that single facts cannot be copyrighted. It is only the whole of a database that can be under a copyright. When you look at the data of Wikidata and its structure, it is in many ways a reflection of all the Wikipedias. Increasingly its data finds its way into Wikidata and as a consequence data that may be found in a specialised database gets included in Wikidata.

Wikidata also has the habit of including identifiers to external sources on an item level. As a consequence people can see what other sources have to say about the same source. It also enables bots to make a comparison. When it writes a report about the differences, it is original research and consequently it does not violate any copyright. When based on such a report people make changes, it takes an effort to find what is correct and consequently it does not violate copyright either. When an agreement is in place, it is possible to add missing data to Wikidata. When done properly there will be an attribution of the original source and, when it is done by a bot, it may be a bot dedicated to that resource.

The objective of the Wikimedia Foundation is to share data. This is why it makes so much sense for Wikidata to have a CC-0 license. As the quality improves, as more and more comparisons are made and the differences are reconciled the data becomes more valuable. Given its scope, not much is out of scope and it is obvious that Wikidata needs to include data from other sources wholesale. It may get information in so many ways. With the CC-0 license it is obvious. Use our data, compare our data, improve our data and this will bring more power to us all.

Wednesday, June 22, 2016

#Wikidata - Home Children

As I am adding more information to a female psychologist, Mrs Margaret Humphreys, I found that she documented Home Children, a British program of sending destitute children abroad. Sending them away was cheaper than leaving them with their families on welfare.

It became such a scandal that prime ministers of several countries that were involved apologised for the awful way people were treated.

Originally it was considered a solution to the slavery in the British match making industry. From good intentions it became something dreadful.

The problem; what statements to use to identify this program, the people who apologised for it, the original good intentions..

Tuesday, June 21, 2016

#Wikidata - the Lange-Taylor Prize

Wikidata knows about many awards and it is a challenge to make the information available but it is even harder to keep them up to date. An example is the Lange-Taylor Prize,

Have a look at the English Wikipedia article, strictly speaking it is not  a list. It is a mish mash. To make this a "proper" Wikidata list, it helps when the "point in time" is added to the award winners. It helps when they are completed. Michel Huneault is the 2015 winner, he or she has to be added as an item to Wikidata and, he is not the only award winner who does not have a red link or an item.

Adding the point in time as a qualifier has an additional relevance. It becomes possible to build a query with no award winners for 2015. When it is missing and this happens a lot both in Wikipedia and Wikidata, we can check the website for the award and maybe find a 2016 winner as well.

As it is, the English article is a stub. There are missing links for instance to Mrs Katherine Dunn. Adding all this info to Wikidata makes improves the quality of its data but it makes it also possible to incorporate this list on both the English and the Czech Wikipedia.

Monday, June 20, 2016

#Wikidata - #Pakistan Peoples Party politicians

There has been an announcement that lists may be generated on a Wikipedia using Wikidata data. For the Urdu Wikipedia, a list of politicians of the Pakistan Peoples Party could be interesting. This functionality is not available yet, but Reasonator does show us what would be on such a list.

As you can see many of them do not yet have an article in Urdu or alternatively a label. Once a label has been added, it will show up in the list. This may also help other languages from Pakistan like Sidhi because the label will fall back to Urdu and not English.

Saturday, June 18, 2016

#Wikidata has a CC-0 license. This should not change.

The Wikipedia Signpost is a publication of the English Wikipedia. It published a piece about copyright and Wikidata and it suggested that a more restrictive license would be fine. Their problem: others benefit and do not need to acknowledge Wikidata as a source.

For me the most important thing of our work is that it is used. Everything we can do to make our data used more increases the value of our data, This is best achieved by refusing to put any restrictions on our data.

One argument for another license is that "it recognises the labour that goes into maintaining the data". The question is how to recognise this and why.  Every data point has its own history both for the property and for the data and as a consequence it is the database that you refer to for the attribution. For human consumption it is the label that gives Wikidata much of its relevance; giving tribute to the people who add labels is as relevant.

Data is mostly generated in an automated or semi-automated way. I would not have over 2 million edits if all statements I added had to be done by hand. With StrepHit, a tool that retrieves facts from authorised sources, data gathering will become even more sophisticated, reliable and complete. The link to personal glory in attribution becomes very much absent.

Wikidata will become increasingly rich in references and tools like StrepHit will ensure the quality of such references. Wikidata is already very rich in references to other sources of data and it is why Wikidata will evolve into a resource for comparison with the data in these sources. These other sources may opt to adopt or report and the same is our option. Comparisons allow us to research the issues that exist with the data we hold and these comparisons will become highly automated and intelligent.

My point is very much that Wikidata is not a glory project. Our data is incomplete and immature and in several ways more ambitious than what a Wikipedia aims to do. Wikidata can include the ambitions of a Wikipedia up to a point. To realise its own ambitions, becoming a valuable and valued resource in the web of data set, it is important to be as open and available as we can be. A license that does not restrict is one of the underpinnings. Moving towards a more restricted license will only create a morass of uncertainty and doubt. It will bring us no benefit.

Thursday, June 16, 2016

#Wikidata - Mark Fiore won the 2016 #Herblock prize

The Herblock prize is just one award I added data to. I grabbed the data from the Wikipedia article and used "Linked Items" to import the winners. I checked the website of the award and noticed that there is a winner.

I added Mr Fiore as the 2016 Herblock prize winner.

I have done this before but something is changing. At Wikidata they are investigating how lists with Wikidata data may be used in a Wikipedia. Now that makes all the work that I have done relevant because I have concentrated on such lists and categories.

When this works out well, it takes one edit to include new data in every Wikipedia that has an interest about certain data. As Wikidata is finally evolving in this direction, things like showing a label, hopefully any label will be what is shown when a label in the language of the Wikipedia is missing are now relevant. Another new feature is that changes from Wikidata may be shown in the history.

The next thing to consider is that when Wikidata knows that somebody studied at a university, it automatically shows in an associated category.. Technically it is not hard, selling it to the Wikipedia crowd maybe.