Wikipedia, Wikidata, and Wikibase: Usage Scenarios for Literary Studies

Wikipedia, Wikidata, and Wikibase: Usage Scenarios for Literary Studies

Frank Fischer, Freie Universität Berlin; Bart Soethaert, Freie Universität Berlin
Freie Universität Berlin
Fand statt
In Präsenz
Vom - Bis
10.10.2023 - 11.10.2023
Jonah Lubin / Frank Fischer, Freie Universität Berlin

One wet morning in October, Digital Humanities (DH) scholars came to the Freie Universität Berlin to discuss the potentials of Wikipedia, Wikidata, and Wikibase for literary research and analysis. This conference came on the heels of a special issue of the “Journal of Cultural Analytics” entitled “Wikipedia, Wikidata, and World Literature”, published in May, 2023. After opening remarks from FRANK FISCHER and BART SOETHAERT (Berlin), who organized the conference, the first day of the conference was to be filled with presentations on the ongoing research of the participants, while the second day was to be devoted to live demonstrations of Digital Humanities projects as well as a general conversation on the status of Linked Open Data (LOD) in Digital Humanities.

The first presentation was given by JACOB BLAKESLEY (Rome). Frustrated by the lack of concrete data usable for measuring literary influence, Blakesley has begun to use the presence or absence of particular authors and their page views in different Wikipedia versions in order to critique the concept of a single, world-literary canon. In his presentation, Blakesley constructed hypothetical canons of world literature based on the compositions of different Wikipedia language versions, in order to demonstrate the diverse assortment and popularity of writers in different traditions. He focused primarily on three authors generally considered to be of world-literary importance (Dante, Shakespeare, and Joyce), and found that their representation varies significantly among different Wikipedia versions. Dante, for example, is only present in 9 of the 48 African language Wikipedia editions. Changing his focus to the works of a single author, Blakesley found that Shakespeare’s plays differ in terms of view-count across Wikipedia versions, with “Romeo and Juliet” being the most viewed in 50 versions (e.g. English and Arabic), “Hamlet” in 37 (e.g. Greek and Korean), and “Macbeth” in 7 (e.g. Malayalam and Igbo). Turning to Italian literature, Blakesley found that the most-viewed Italian authors differ between the English and Italian versions of Wikipedia: Montale is present in the top-10 in Italian Wikipedia, and absent in the top-10 of the English Wikipedia, whereas Marinetti and Pavese have higher profiles in the English Wikipedia than in the Italian. Future researchers might be interested to investigate the role diglossia plays in the composition of Wikipedia editions. That is to say, is Dante’s absence in the Tajik Wikipedia due to his presence in the Russian Wikipedia? Is there a sort of division of labor between international and national, hegemonic and non-hegemonic canons in diglossic linguistic systems?

BASTIAN BUNZECK and SINA ZARRIEß (Bielefeld) study literary characters beyond their service to the plot, in a reception-oriented approach to world literature. Wikipedia is perfectly suited for this purpose because of its international reach and use of independent pages for individual characters. To collect their data, Bunzeck and Zarrieß constructed a bipartite network, wherein the first set of nodes represents Wikidata entries for individual characters (extracted from those Wikidata entries labeled ‘literary characters’ [Q3658341]), while the second set of nodes represents Wikipedia language editions (e.g. English, Yiddish, Hausa), wherein those characters have a page of their own. Bunzeck and Zarrieß claim that the characters represented with pages in the most Wikipedia language editions are also the most “autonomous,” that is to say, the ones most appearing outside of the contexts of individual works. The three most “autonomous” characters are Sherlock Holmes, Superman, and Santa Claus, none of whom are particularly associated with a single famous work. Interestingly, there is a disproportionate number of characters from classical literature in the top-30, including Arjuna, Maitreya, and Gilgamesh, all of whose divinity helps to explain their autonomy. Bunzeck and Zarrieß were not only interested in which characters are most represented in Wikipedias, and therefore are most autonomous; they were also interested to learn how these characters are portrayed in Wikipedia and how they relate to one another. In order to study this, they used Wikipedia2Vec, a tool for obtaining embeddings from Wikipedia. They found that characters tend to cluster together (and cluster away from works, for example), and are usually close to their authors as well. They also found that characters exhibit the same gender-related biases as common nouns. For example, Sherlock Holmes has a positive bias towards “career,” while Pippy Longstocking has a positive bias towards “family.”

FUDIE ZHAO (Oxford), who studies East Asian History and Digital Humanities, presented on the uses of Wikidata in Digital Humanities as described in her recent paper “A systematic review of Wikidata in Digital Humanities projects.” Her research was guided by four questions. The first question was: How is Wikidata described in contemporary DH literature? She found that the major use of Wikidata is as a “content provider,” whose linked data can be used for a variety of purposes. Wikidata is also used as a platform to disseminate research, and as a technology stack for publishing linked data. Her second question was: To what end is Wikidata being experimented with in the DH domain? She found that many DH projects use Wikidata as an external LOD resource which can enrich their own materials (e.g. corpora, texts). Others use Wikidata to curate metadata from authority datasets, to model knowledge in the context of the Semantic Web, or to help with NER-related tasks. Her third question was: What are the potentials of incorporating Wikidata into DH projects? In 90% of the studies she examined, the content of Wikidata was used as a data source for the project; in 10% of projects, Wikidata was used as a platform for data publication and exchange. Her fourth question was: What are the challenges associated with Wikidata mentioned in DH projects? She found that technical problems are rarely reported in papers, but when they are, they usually have to do with technical challenges like identifier mismatches as well as concerns about quality due to Wikidata’s open model. The following discussion focused on ways the DH community can combat the weaknesses of Wikidata. Proposed solutions included learning from the standards practices of institutions like libraries, and greater collaboration among DH scholars to enrich and interlink Wikidata content.

Library and information scientists JUAN ANTONIO PASTOR SÁNCHEZ and TOMÁS SAORÍN (Murcia) decided to see what the literary canon would be according to Wikipedia and Wikidata. They stressed the outmodedness of the concept of canon, and stated that they see the creation of a literary canon as a game, which is usually taken too seriously. In order to extract a canon of the most central literary works in the Wikiverse, they first made a list of all Wikidata items typed ‘literary work’ (Q7725634). Opposed to the use of social metrics like page views, they then found the number of properties for each Wikidata item, the number of Wikipedia articles associated with that item, and the total number of word count of all those articles. They used these three vectors as a metric (which they term Wiki3DRank) to produce a ranking of the most significant literary works on Wikipedia as a whole, then used the Clustering K-means++ method to separate the ranked works into three groups: literary canon, essential books, and literary or bibliographic production. Their top-20 canonical books, beginning with “Genesis” and ending with “Crime and Punishment,” are generally unsurprising, although the inclusion of “Harry Potter” and “The Hobbit” indicate a pop-cultural bias often detected when working with the Wikiverse. Their ranking also produces unorthodox canons of national literatures as in the case of Italian, where Pinocchio is present and Dante is absent. Their metric ranks highly certain works that would usually not be considered to be of literary-canonical value, such as “Mein Kampf” and the “Guinness Book of World Records.” Generally, they found that their metric ranks works as they are received by cultures other than the one that produced them – a sort of view from the outside, especially as determined by the anglophone world. Potential issues with their methods have to do with the inexactitudes of Wikidata classifications (e.g. “The Odyssey,” which at the time was not classified as a ‘literary work,’ but as an ‘epic poem’) and the changeable nature of the number of properties for Wikidata items.

In his presentation, MARCO ANTONIO STRANISCI (Torino) discussed his attempts to study and ameliorate the underrepresentation of non-Western writers on Wikidata and Wikipedia. Recently, he has been working on a project “World Literature Knowledge Graph” (WL-KG), which models the lives and works of authors. He found that if he simply scraped all people with the occupations ‘writer’ (Q36180), ‘poet’ (Q49757), or ‘novelist’ (Q6625963) from Wikidata, that transnational authors were highly underrepresented. He defines transnational authors as those who belong to ethnic minorities in the West, and those who were born in former colonies from 1808 in the case of Latin America and from 1917 in the cases of Africa and Asia. In order to close this gap, he began to work on a pipeline that turns raw text biographies of transnational authors (mostly gathered using Goodreads and Open Library) into structured knowledge. This pipeline consists of five steps: (1) coreference resolution of the entity-target of the biography, (2) event detection, (3) mapping to the Wikibase data scheme, (4) named entity recognition of the triple’s object, (5) entity linking. With this process, he is able to detect a relevant sentence, e.g. “He served as the Herald’s editor during the 1951–52 school year.”, and then parse and link the relevant information contained therein: He → ‘Chinua Achebe’ (Q155845); editor → ‘employer’ (P108); Herald → ‘The Herald’ (Q7739400). These methods have significantly increased the number of transnational writers in his knowledge graph. Discussion focused on the non-canonical potential, and relatively canonical actuality, of Wikipedia and Wikidata.

ALAN ANG and LUCY PATTERSON (Berlin), both from Wikimedia Deutschland, came to discuss the current and future status of Wikidata. They are particularly invested in promoting Linked Open Data. In order to do so, they have developed these five pillars for Wikidata: (1) empower the community to increase data quality, (2) facilitate equity in decision-making, (3) increase re-use for increased impact, (4) strengthen underrepresented languages, (5) enable Wikimedia projects to share their workload. Recently, they have been working on such issues as the legibility of EntitySchemas, finding mismatches, tweaking their UI, and improving Wikidata analytics. They also presented Wikibase, an open source platform for managing and sharing structured data. Wikibase is the technology behind Wikidata, and can be used in order to create project-specific knowledge bases. There was also some discussion of Wikibase Cloud, a service, currently in open beta, which offers free cloud instances of Wikibase.

MARIA HINZMANN and TINGHUI DUAN (Trier) discussed their project MiMoText, short for Mining and Modeling Text. The goal of the project is to lay a groundwork for an information network for the humanities fed from various sources which would be made available to the public as Linked Open Data. This would be a sort of Wikidata for literary history. To begin, they constructed the knowledge base called MiMoTextBase, which is primarily focused on the French Enlightenment novel (1750–1800), using the aforementioned Wikibase software as a platform. In comparison to Wikidata, MiMoTextBase has better coverage, higher density, and more explicit data modeling for this subject and period. The data for their knowledge graph was mined from three sources: bibliographic data, primary literature (205 novels in full-text TEI), and scholarly publications. The data produced by this mining was then structured in an ontology composed of 11 modules: theme, space, narrative form, literary work, author, mapping, referencing, versioning & publication, terminology, bibliography, and scholarly work. They have made their data available via a SPARQL endpoint, and have connected their knowledge base with Wikidata on both ends, by placing links to Wikidata in MiMoTextBase, and also by creating the MiMoText ID as a new property on Wikidata. This allows for federated queries that use the information and connections of both knowledge bases.

OLAF SIMONS (Halle-Wittenberg) provided us with another use of Wikibase, FactGrid, “a database for historians.” The project, which was initiated at the Gotha Research Centre in cooperation with Wikimedia Germany, is community-driven with all editors required to use their real names and to explicitly state their particular research interests. The presentation focused on two investigations of prose fiction: Simons’s own project, which originated in his 2001 dissertation on the English and German book markets between 1710 and 1720, as well as a project by MARIE GUNREBEN (Konstanz) on German prose fiction published between 1680 and 1750. The basic notion of these projects is that the first modern literary historians of the 18th and 19th centuries omitted these decades in their own attempts to understand literature as a historical phenomenon.

After the presentations on the first day, the second day focussed on practical demonstrations of digital methods. VIKTOR ILLMER (Berlin) began with a walkthrough of a Jupyter notebook meant to be an easy, low-code way of querying, parsing, and displaying data from the Wikipedia API, which would enable less technically-inclined humanities scholars to use this data as part of their arguments. The notebook supports three selection methods: it selects all the language editions of a particular article (e.g. ‘Thomas Pynchon’), all the articles in a category (e.g. ‘Chinese women novelists’), or all articles contained within a user-supplied TSV file. Data is then retrieved for the selected pages, including length, number of contributors, number of revisions, number of page views, and stores it in a Polars DataFrame. The notebook then provides options for plotting using plotly (bar and regression plots). The discussion began with many recommendations for expanding the notebook, such as changing the time-frame for page views, examining page views by country of origin, adding NLP functions. Unfortunately, multiple API connections are time-consuming, so querying a large amount of Wikipedia articles will inevitably take a long time.

MARCO ANTONIO STRANISCI (Torino) followed by showing a demo of the user interface for his World Literature Knowledge Graph, which he had presented the day before. He began by expressing his gratitude to those at the University of Bari who designed a visualization for his database that emphasized the discovery of authors unknown to the user. The visualization is powered by SPARQL queries and operates on Semantic Web principles. The user begins to use the interface by searching for an entity like an author, place, or work. Once the desired entity is found (e.g. ‘Sholem Asch’), the user drags it into the middle of the page, where it is visualized as a node. By clicking on that node, its attributes are displayed on the right-hand side (years lived, works, subjects, citizenship, etc., in the case of an author). These attributes can then be dragged into the center part of the screen, where their relation to other entities is graphically displayed. When many authors, places, works, and attributes are placed into the center, a complex web of relation becomes visible. The source for the data present in the visualization (often Goodreads), is also cited. Discussion focused on the difficulties of fitting unstructured data from sources like Goodreads into FRBR categories. Future plans for the project include a feature to recommend transnational authors to the user.

In the next demo, SINA ZARRIEẞ (Bielefeld) demonstrated Wikipedia2Vec, a tool used to obtain word embeddings from Wikipedia, some findings of which had been presented during the talk she and Bastian Bunzeck gave on the previous day. Briefly, word embeddings are words expressed as vectors in a continuous space. In the case of Wikipedia2Vec, being derived from Word2vec, vectors are derived from the contexts in which words are found. As such, if the vectors of a word are close to each other in vector space, then they usually appear in a similar context, at least in the texts which were used to train the word embeddings. An interesting feature of Wikipedia2Vec is that it distinguishes between words, the standard currency of word embeddings, and “entities,” which are Wikipedia pages. As mentioned in her previous presentation, Zarrieß is particularly interested in understanding bias in NLP, particularly in Large Language Models (LLMs). Bias is present in word embeddings as well, and projects like WEFE, a framework for bias measurement and mitigation for word embeddings, are trying to address this. Qualitative inspections of Wikipedia2Vec embeddings suggest that these have a great deal of gender bias: female literary characters are more associated with the household and emotional sensitivity, whereas male literary characters are more associated with careers and strength. These biases in word embeddings, although unfortunate in some contexts, can be useful for Literary Studies as a way to understand the biases of a text, an author, or of a corpus of work.

For the final presentation of the day, MARIA HINZMANN (Trier) provided a hands-on demonstration of MiMoText, which she had introduced the day before. Here, she wanted to give a quick introduction to SPARQL and show what is special about the project’s data modeling. She demonstrated SPARQL queries which provide information on publication dates over time, topic modeling, and bibliographic data. When asked why she and her team had not developed user interfaces to simplify these and other tasks, she responded that many visualization tools had already been built in the Wikidata Query Service, which can do things like produce tables, maps, bubble charts, etc., and she did not think that building similar tools of her own was an efficient use of resources for the project. As for data modeling, she emphasized the database’s particular usefulness for Literary Studies, with attributes such as ‘theme’ built into entries, and expressed her hope that more people will contribute to the joint endeavor of creating an ontology for literary history.

The workshop concluded with a general discussion on the use of the Wikiverse for literary scholarship. Topics included the sociology of Wikipedia, the need for a cross-project ontology for literary history, Wikidata as a Linked Open Data hub and data sustainability.

It seems there is a general desire among Digital Humanists to link their projects. The question is: on whose terms? Hopefully, a collaborative effort will materialize in the coming years to make Digital Humanities projects more interoperable, even if it does not result in a standard ontology for Digital Literary Studies. In the meantime, as these presentations demonstrated, there is a great deal of interesting research in Literary Studies being done with the Wikiverse, and there is a great deal more to be done.

Conference overview:

Jacob Blakesley (Rome): Measuring Literary Popularity Using Wikipedia

Bastian Bunzeck (Bielefeld) / Sina Zarrieß (Bielefeld): Where and How Do Literary Characters Figure in Wikipedia?

Fudie Zhao (Oxford): Wikidata Usage Scenarios for Literary Studies: Insights from ‘A Systematic Review of Wikidata in Digital Humanities’

Juan Antonio Pastor-Sánchez (Murcia) / Tomás Saorín (Murcia): Measuring the Literary Field of Creative Works: The Case of Literary Works According to Wikipedia and Wikidata

Marco Antonio Stranisci (Torino): Deriving World Literatures from Wikidata and Other Communities of Readers

Alan Ang (Wikimedia Deutschland) / Lucy Patterson (Wikimedia Deutschland): Status and Future Plans of Wikidata

Maria Hinzmann (Trier) / Tinghui Duan (Trier): Linked Open Data for Literary History: Constructing, Querying and Using the MiMoTextBase

Marie Gunreben (Konstanz) and Olaf Simons (Martin-Luther-Universität Halle-Wittenberg): Prose Fiction and Dubious Histories of the 17th and 18th Centuries on FactGrid

Viktor Illmer (Berlin( / Frank Fischer (Berlin): Querying the Wikipedia API to Explore the Positioning of Authors and Works: A Jupyter Notebook Walkthrough

Marco Stranisci (Torino): Demonstration of World Literature Knowledge Graph (WL-KG)

Sina Zarrieß (Bielefeld): Wikipedia2Vec Demonstration

Maria Hinzmann (Trier): MiMoText Tutorial