Knowledge Graphs

Paul Ellis
Analytics Vidhya
Published in
8 min readFeb 3, 2021

--

The term ‘Knowledge Graph’ can be traced back to the Google Knowledge Graph, 2010, ‘The basic motto behind the Google Knowledge Graph was to make search about things not strings.’ [1]. It should come as no surprise that the success of the Google Knowledge Graph coincided with the increase in knowledge bases. Over recent years Knowledge graphs have become a popular way to represent data and can be considered as a semblance for knowledge discovery, data mining, Semantic Web and Natural Language Processing, NLP.

In essence the Knowledge graph concerns the nature of structured data in a graphical format and is utilised to represent the relationships between its data points. The criteria for a Knowledge Graph is abstractly described by ­Kejriwal as:

‘a graph-theoretic representation of human knowledge such that it can be ingested with semantics by a machine. In other words, it is a way to express ‘knowledge’ using graphs, in a way that a machine would be able to conduct reasoning and inference over this graph to answer queries (‘questions’) in some meaningful way.’ Kejriwal (2019).

An illustration of a knowledge graph can be found in Fig 1 with a panel rendered in Google for the search query ‘springsteen’, the panel itself is powered by Knowledge graph-centric technologies similar applications include Wikidata and Freebase.

Fig. 1 Google knowledge panel

From a functional perspective however a knowledge graph is derived from a set of triples which represent a fact. In its rudimental form a triple is a 3 tuple (h, r, t) where h represents the head entity, t represents the tail entity and r expresses the relationship between the two entities h and t. A simple example will help to explain the concept of triples.

‘Fido the dog stole a bone from Mary’s garden

The triples associated with this sentence can be categorised as:

{(Fido, is-a, Dog),

(Fido, stole, bone),

(bone, is-a, Bone),

(bone, located-in, garden),

(garden, is-a, garden),

(garden, belongs-to, Mary),

(Mary, is-a, Person)}.

As discussed, triples represent an assertion with element 1 and 3 representing an entity whilst element 2 is normally referred to as relations or relationships. In the event where for example Mary, DOB, ‘12/07/95’ a number or string is expressed as a literal.

In order to construct a knowledge-graph a process of Information Extraction, IE, is undertaken whereby the language, which can be both semi and/or unstructered, is read and translated in conjunction with an ontology, ‘a description of things, relationships, and their characteristics’, Powell (2015). The process for information extraction includes:

Named Entity Recognition, NER, used to identify and divide the named entities and categorize these entities in predefined classes.

Relation Extraction, used to identify the semantic relationships

Event Extraction determines what happened and when it happened. Recent exciting developments in this field have utilised Generative Adversarial Networks (GANs) [2]

The next step in constructing Knowledge graphs is, as the name suggests, Entity Resolution, which entails linking entities in data to real world data. The difficulty of course, particularly in the case of natural languages, is the subtleties and ambiguities which result in misinterpretation ‘Even today, despite rapid advances, machines still cannot read and understand English nearly as well as humans.’ Springer (2019)

Resource Description Framework (RDF) and Ontologies

We touched upon ontologies in the previous section and how ontologies define types of things, that is, their relationships to other things and the properties applicable to those things. An ontology can be considered a ‘graph model for a vocabulary’, Powell (2015) and is in part an extension of a taxonomy which itself is a graph structure that is used to group hierarchical relationships. ‘In an ontology, the taxonomy is extended so that you express logical relationships among things, membership in groups, multiple inheritance relationships, the symmetry of relationships, exclusiveness, and various characteristics of a given thing.’ Powell (2015).

An ontology utilises the first-order logic in order to express statements about truth, similar to the triple that we have already discussed. The process of constructing an ontology involves:

1. Select unique terms based upon domain information

2. Define each term

3. Organise terms hierarchically

A Resource Description Framework, RDF, can be described as ‘a data model for the Web of Data and the Semantic Web providing a logical organization defined in terms of some data structures to support the representation, access, constraints, and relationships of objects of interest in a given application domain.’ Curè (2015). RDF also utilises the triples data model which correspond to Internationalized Resource Identifiers. IRI, which are very similar to the URI or URL but permit Unicode characters. Whereas a URL will resolve to a particular web address an IRI may not resolve to anything. An example of an IRI from DBpedia, which is a collection of RDFs whose source is Wikipedia and identifies a book:

http://dbpedia.org/resource/The_Road

‘The RDF model requires that you be able to relate any subject, predicate or object IRI to a vocabulary. A vocabulary, such as a taxonomy or ontology, identifies what the IRI is and provides information about its relationship to other items in that vocabulary.’ Powell (2015). The process of querying and manipulating RDF from the semantic web was outlined by W3C and translated into the query language Simple Protocol and RDF Query Language (SPARQL). SPARQL 1.0 became a recommendation in 2008 [3] and SPARQL 1.1 was published in 2013 [4]. SPARQL is an RDF query language which retrieves and is able to manipulate data stored in RDF. ‘SPARQL is based on the notion of Basic Graph Patterns which are sets of triple patterns. A triple pattern is an extension of an RDF triple where some of the elements can be variables which are denoted by a question mark.’ Gayo (2018). Additional ontology languages used for data interrogation include RDFS and the more expressive OWL/2.

Storage

Underpinning the RDF graph is storage which can take different formats whether XML or within relational and/or RDF databases, also known as Triplestores, which include GraphDB , AllegroGraph, Ontotext and Oracle Spatial. Oracle Spatial is based upon the RDF WC3 standard ‘However, while this capability provided some significant graph capabilities, it fell short of the capabilities offered in popular property graph databases such as Neo4J.’ Harrison (2015). That said, RDF graphs, as we have discussed, provide strong support for the development of ontologies and distributed data sources and as we shall see provide a richer data model whereas Neo4j, a graph database which has its own query language Cypher and stores a graphs in the mathematical sense as it relates to a set of nodes and relationships holding between these nodes. Similarly, graph databases provide persistence, create/update/delete capability, indexing, path traversal and search and query capabilities for graphs.

Ecosystems

An example of an adoption of Knowledge graphs in the KG ecosystems is DBpedia, which we have already touched upon during our discussion on IRI. DBpedia attempts to leverage structured information from Wikipedia. The data contained on the Wikipedia page for Mandalorian, Fig 2, is extracted and rendered as an RDF in DBpedia, Fig 3. ‘DBpedia is available both as RDF dumps, and as a queryable SPARQL endpoint.’ Kejriwal (2019)

Fig. 2 Wikipedia page for Mandalorian
Fig 3. The DBpedia dashboard for Mandalorian

The popularity of DBpedia coincides with the continuing growth of Wikipedia as Natural Language Processing, NLP, ecosystems leverage the data for weak supervision, noisier, lower-quality, but larger-scale training sets, problems with the added advantage that Wikipedia is multi-lingual.

Although by inference Schema.org would suggest a website it is actually a mark-up language which can be incorporated into web pages to enable web search engines the ability to transform the returned query into a structured response. Schema.org work in collaboration with Google, Bing and Yahoo to transform responses and thus improve a web-sites visibility rating by showing additional information such as product pictures, star ratings, product prices, dates and even events. A search engine scrapes the data from the site and is thus able to gather further pertinent information. In the example below the UK clothes retailer Rohan contains schema.org snippets, Fig 4, which enables Bing to extract and display further information including a user rating, Fig 5. However, it should be noted that schema.org do not attempt to link to other schema.org mark-up and so in one sense it is a loose affiliation to Knowledge graphs. Schema.org are in a process of enabling a form of Entity Resolution to overcome this gap.

Fig 4. Embedded schema.org snippets in Rohan’s HTML pages
Fig 5. A query of Rohan top 15 jackets includes an additional user rating panel.

Conclusion

We have discussed how knowledge graphs have risen in popularity in line with the increase in knowledge bases and how through a process of triples the data can be represented as singular fact which in conjunction with Information Extraction, IE. enables us to perform entity resolution on real world data. This process, which utilises ontology, an extension of taxonomy, to express relationships is utilised in resource description framework, RDF, to provide logical structures which permits data extraction through Internationalized Resource Identifiers. IRI.

Within the Knowledge Graph ecosystem, we drew upon examples that made use of these processes, in particular, DPpedia which utilised Wikipedia to enable the extraction of training sets for NLP. Similarly, schema.org who, in conjunction with Google, Bing and Yahoo were able to embed snippets into html pages which enabled search engines to include additional information pertaining to a search query, whether product pictures, star ratings, product prices, dates and even events.

Paul

References

A Librarian’s Guide to Graphs, Data and the Semantic Web, James Powell and Matthew Hopkins Los Alamos National Laboratory, Chandos Publishing, 2015

Building ontologies with basic formal ontology, Robert Arp, Barry Smith, and Andrew D. Spear, MIT Press 2015

Domain-Specific Knowledge Graph Construction, Mayank ­Kejriwal, 2019, Springer

Graph-based Knowledge Representation Computational Foundations of Conceptual Graphs, Michel Chein Marie-Laure Mugnier, Springer-Verlag London Limited 2009

Next Generation Databases NoSQL, NewSQL and Big Data, Guy Harrison, Apress, 2015

RDF Database Systems — Triples storage and SPARQL query processing. Olivier Curè, Guillaume Blin, 2015, Elsevier

Validating RDF Data, Jose Emilio Labra Gayo, University of OviedoMorgan & Claypool publishers SYNTHESIS LECTURES ON SEMANTIC WEB: THEORY AND TECHNOLOGY #16 2018

Vannevar Bush’s 1943 essay “As We May Think” http://web.mit.edu/STS.035/www/PDFs/think.pdf

1. Singhal, A.: Introducing the knowledge graph: things, not strings. Off. Google Blog 5 (2012). https://www.blog.google/products/search/introducing-knowledge-graph-things-not/

2. https://www.analyticsvidhya.com/blog/2017/06/introductory-generative-adversarial-networks-gans/

3. E. Prud’hommeaux and A. Seaborne. SPARQL Query Language for RDF. W3C Recommendation, 2008. http://www.w3.org/TR/rdf-sparql-query/

4. S. Harris and A. Seaborne. SPARQL 1.1 Query Language. W3C Recommendation, 2013. http://www.w3.org/TR/sparql11-query/

--

--