Knowledge Graph
After performing named-entity recognition (NER), we finally come to the last step – visualization of the result of NER in the form of a graph.
What is a Knowledge Graph?
A knowledge graph is a structured representation of information. It organizes and connects entities, concepts, and relationships in a graph. The graph is constructed by transforming all named entities as nodes and relationships as directed edges. By exploring the graph, we may uncover some hidden patterns and interesting relationships between entities.
Neo4j Graph Database Software
To facilitate the knowledge graph construction, we utilize Neo4j software which allows efficient storing and querying of the graph data. The query language of Neo4j is called Cypher, which is the format of NER output in the last section. The simplest query for importing data is like CREATE (John)-[:ROOMMATE_OF]->(Anson)
, which means adding two nodes, John and Anson, and linking them with the relationship named ROOMMATE_OF
. Neo4j enables visualization of the graph, providing a visual representation of the entity relationships.
Graph Exploration
To discover interesting insights and patterns from the graph, we utilize several network science concepts for the graph analysis. This includes centrality measures (degree centrality, PageRank algorithm) and community detection algorithms like label propagation.
Centrality is a measure of how “important” is a node in the whole graph. We are interested in the nodes with high centrality since they should contribute more to the whole network and have more connections with other nodes. Degree centrality is simply the degree of a node, which is the number of neighbours the node has. The higher the degree of centrality, the more the nodes are connected. Another centrality measure is PageRank, an algorithm proposed by Google and implemented in their search engine. Intuitively, it calculates the most popular “pages” if the links are clicked randomly.
Apart from centrality measures, community detection or clustering is also a common technique for graph analysis. We aim to divide the whole complex network into several communities (or subgraphs) with a higher density. This can simplify our analysis and discover some interesting communities which seem unrelated at first glance but actually have high correlations. Label propagation is one of the community detection algorithms which tries to colour some nodes with high centrality and propagate to other nodes with connections.
Examples
1. Degree Centrality
We first sort all nodes according to their degree centrality in descending order.
CALL gds.degree.stream('myGraph')
YIELD nodeID, score
RETURN gds.util.asNode(nodeID).name AS name, score AS degree
ORDER BY degree DESC, name DESC
The above query yields the following output:
[{"name": "我" (Me), "degree": 1258},
{"name": "本報" (This Newspaper), "degree": 802},
{"name": "祖國週刊", "degree": 629},
{"name": "香港" (Hong Kong), "degree": 614},
{"name": "中國學生周報" (Chinese Student Weekly), "degree": 472},
{"name": "友聯書報發行公司" (Union Press Circulation Co.), "degree": 464}, ...]
It shows that “Me” is the most important node in the graph according to degree centrality. All nodes directly connected to “Me” are shown below:
2. PageRank
We first sort all nodes according to their PageRank in descending order.
CALL gds.pageRank.stream('myGraph')
YIELD nodeID, score
RETURN gds.util.asNode(nodeID).name AS name, score AS score
ORDER BY score DESC, name DESC
The above query yields the following output:
[{"name": "香港" (Hong Kong), "score": 307.08},
{"name": "中國" (China), "score": 147.96},
{"name": "美國" (United States), "score": 100.53},
{"name": "九龍" (Kowloon), "score": 70.22},
{"name": "日本" (Japan), "score": 51.78},
{"name": "臺灣" (Taiwan), "score": 48.3}, ...]
It shows that “Hong Kong” is the most important node in the graph according to PageRank. All nodes directly connected to “PageRank” are shown below:
One may also want to know how CUHK is related to other entities. Similarly, we can query and visualize the subgraph as follows: