Publications

The CaLiGraph Ontology as a Challenge for OWL Reasoners

Abstract: CaLiGraph is a large-scale cross-domain knowledge graph generated from Wikipedia by exploiting the category system, list pages, and other list structures in Wikipedia, containing more than 15 million typed entities and around 10 million relation assertions. Other than knowledge graphs such as DBpedia and YAGO, whose ontologies are comparably simplistic, CaLiGraph also has a rich ontology, comprising more than 200,000 class restrictions. Those two properties – a large A-box and a rich ontology – make it an interesting challenge for benchmarking reasoners. In this paper, we show that a reasoning task which is particularly relevant for CaLiGraph, i.e., the materialization of owl:hasValue constraints into assertions between individuals and between individuals and literals, is insufficiently supported by available reasoning systems. We provide differently sized benchmark subsets of CaLiGraph, which can be used for performance analysis of reasoning systems.

Information Extraction from Co-Occurring Similar Entities

Abstract: Knowledge about entities and their interrelations is a crucial factor of success for tasks like question answering or text summarization. Publicly available knowledge graphs like Wikidata or DBpedia are, however, far from being complete. In this paper, we explore how information extracted from similar entities that co-occur in structures like tables or lists can help to increase the coverage of such knowledge graphs. In contrast to existing approaches, we do not focus on relationships within a listing (e.g., between two entities in a table row) but on the relationship between a listing's subject entities and the context of the listing. To that end, we propose a descriptive rule mining approach that uses distant supervision to derive rules for these relationships based on a listing's context. Extracted from a suitable data corpus, the rules can be used to extend a knowledge graph with novel entities and assertions. In our experiments we demonstrate that the approach is able to extract up to 3M novel entities and 30M additional assertions from listings in Wikipedia. We find that the extracted information is of high quality and thus suitable to extend Wikipedia-based knowledge graphs like DBpedia, YAGO, and CaLiGraph. For the case of DBpedia, this would result in an increase of covered entities by roughly 50%.

Entity Extraction from Wikipedia List Pages

Abstract: When it comes to factual knowledge about a wide range of domains, Wikipedia is often the prime source of information on the web. DBpedia and YAGO, as large cross-domain knowledge graphs, encode a subset of that knowledge by creating an entity for each page in Wikipedia, and connecting them through edges. It is well known, however, that Wikipedia-based knowledge graphs are far from complete. Especially, as Wikipedia's policies permit pages about subjects only if they have a certain popularity, such graphs tend to lack information about less well-known entities. Information about these entities is oftentimes available in the encyclopedia, but not represented as an individual page. In this paper, we present a two-phased approach for the extraction of entities from Wikipedia's list pages, which have proven to serve as a valuable source of information. In the first phase, we build a large taxonomy from categories and list pages with DBpedia as a backbone. With distant supervision, we extract training data for the identification of new entities in list pages that we use in the second phase to train a classification model. With this approach we extract over 700k new entities and extend DBpedia with 7.5M new type statements and 3.8M new facts of high precision.

Uncovering the Semantics of Wikipedia Categories

Abstract: The Wikipedia category graph serves as the taxonomic backbone for large-scale knowledge graphs like YAGO or Probase, and has been used extensively for tasks like entity disambiguation or semantic similarity estimation. Wikipedia's categories are a rich source of taxonomic as well as non-taxonomic information. The category German science fiction writers, for example, encodes the type of its resources (Writer), as well as their nationality (German) and genre (Science Fiction). Several approaches in the literature make use of fractions of this encoded information without exploiting its full potential. In this paper, we introduce an approach for the discovery of category axioms that uses information from the category network, category instances, and their lexicalisations. With DBpedia as background knowledge, we discover 703k axioms covering 502k of Wikipedia's categories and populate the DBpedia knowledge graph with additional 4.4 M relation assertions and 3.3 M type assertions at more than 87% and 90% precision, respectively.

Code

The complete code for the extraction of CaLiGraph is available on GitHub.

Data

The complete dataset is hosted on Zenodo. All files are gzipped and in N-Triples format. The data is published under the Creative Commons Attribution 4.0 International Public License.
The complete dataset is also available on the DBpedia Databus. Additionally, a version of DBpedia enriched with CaLiGraph is provided as collection.

caligraph-metadata.nt.bz2

Metadata about the dataset which is described using void vocabulary.

caligraph-ontology.nt.bz2

Class definitions, property definitions, restrictions, and labels of the CaLiGraph ontology.

caligraph-ontology_dbpedia-mapping.nt.bz2

Mapping of classes and properties to the DBpedia ontology.

caligraph-ontology_provenance.nt.bz2

Provenance information about classes (i.e. which Wikipedia category or list page has been used to create this class).

caligraph-instances_types.nt.bz2

Definition of instances and (non-transitive) types.

caligraph-instances_transitive-types.nt.bz2

Transitive types for instances (can also be induced by a reasoner).

caligraph-instances_labels.nt.bz2

Labels for instances.

caligraph-instances_relations.nt.bz2

Relations between instances derived from the class restrictions of the ontology (can also be induced by a reasoner).

caligraph-instances_dbpedia-mapping.nt.bz2

Mapping of instances to respective DBpedia instances.

caligraph-instances_provenance.nt.bz2

Provenance information about instances (e.g. if the instance has been extracted from a Wikipedia list page).

dbpedia_caligraph-instances.nt.bz2

Additional instances of CaLiGraph that are not in DBpedia.
This file is not part of CaLiGraph but should rather be used as an extension to DBpedia.

dbpedia_caligraph-types.nt.bz2

Additional types of CaLiGraph that are not in DBpedia.
This file is not part of CaLiGraph but should rather be used as an extension to DBpedia.

dbpedia_caligraph-relations.nt.bz2

Additional relations of CaLiGraph that are not in DBpedia.
This file is not part of CaLiGraph but should rather be used as an extension to DBpedia.