Using Cypher to work with the Australian National Graph

The Australian National Graph is the first large-scale initiative to develop a national PID graph database. Using the Research Graph schema, it connects more than a thousand research organisations to their associated research outputs.

Source

Introduction

In this article, we look at the graph database developed as part of the National Graph project and how to work with it using Cypher queries.

As of April 24, 2024, the Australian National Graph, the graph database developed as part of this project, included data on 259,111 Australian researchers, 27,566 datasets, 1,764,309 publications, 88,234 grants, and 4,406 Australian organisations. The data is accessible to Australian researchers under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. This database uses the Research Graph schema to connect researchers, organisations and associated entities.


PIDs, or Persistent Identifiers, are unique codes that identify digital objects, people, or concepts. Some of the major PIDs providers are Crossref, ORCID, and DataCite, among others. The National Graph project is a collaborative approach to building a national-level graph of persistent identifiers. This approach facilitates gaining insights into the collaborations between research institutions, industry, and international partners. The graph was constructed and deployed using Neo4j.

For more information about the Research Graph schema, refer to this post.

Architecture

The National Graph uses a three-layer architecture to create, optimise and distribute the graph. This layered architecture allows for the separation of key components, enhancing efficiency and serviceability.

PID Graph: The data from major PID providers like ORCID, PubMed, Crossref etc are extracted using a data pipeline. In the pipeline, data is cleaned, special characters removed, deduplicated(meaning for example, the same researcher from PubMed and Crossref should only exist as one entry in the database, not two) and crosslinked(all the connected relationships of these deduplicated entities should still exist. For example, all the publications of the same researcher that came from PubMed should still be connected to them along with all the publications that came from Crossref). The data is stored in MongoDB and intermediate csv files of the data stored are generated. These files are then used to create the Neo4j graph using Microservices. This layer uses the Research Graph schema to connect the PIDs by applying metadata, text mining, and entity resolution algorithms.

The graph shows the publications connected to the researcher Ashley I. Bush. The red nodes represent the publications connected to ORCID. The green nodes represent the publications connected to PubMed.

Optimisation Layer: This is the layer where Australian research output gets converted to an optimised graph. It applies entity resolution, where for example, the nodes in the graph associated with The Australian National University and ANU will be combined since they refer to the same entity, meaningful links between Datasets, Researchers, Organisations, Grants and Publications are created and a network of associated organisations are also created.

Access Layer: This layer consists of a distributed network of Neo4j databases, enabling easy access to the National Graph content using the Neo4j Graph interface. This decentralised structure allows Australian researchers to utilise their own cloud infrastructures for managing and coordinating database access efficiently, allowing dynamic allocation of resources based on specific needs. This ensures flexible and scalable data management.

Note: To gain access to the Australian National Graph, you need to be a partner of The Research Graph Foundation. The access will be provided on an agreement basis.

Cypher

The Australian National Graph is deployed on Neo4j, and accessing its content requires having knowledge in Cypher. Cypher is Neo4j’s query language for interacting with graph databases.

Cypher is similar to SQL, but optimised for graphs. Its constructs are based on English prose and iconography. This makes queries easy to both read and write. Cypher also provides a visual way of matching patterns and relationships by having its own design based on ASCII-art type of syntax.

Round brackets are used to represent (:Nodes), and -[:ARROWS]→ to represent a relationship between the (:Nodes). With this query syntax, you can perform create, read, update, or delete (CRUD) operations on your graph.

A graph example involving four nodes and three relationships. Source.

This graph could be translated into English as Sally likes Graphs. Sally is friends with John. Sally works for Neo4j.

In Cypher, the same information would look like this:

With these basic building blocks in mind, the following cypher queries can be used to work with the Australian National Graph

Graph Exploration

Finding the number of grants for each source

sourceno_of_grants
“arc.gov.au”32338
“orcid.org”29596
“nhmrc.org”28547
“crossref.org”49

Finding the most common number of years a grant is given

sourceduration_of_yearsno_of_grants
“arc.gov.au”312399
“nhmrc.org”29584
“orcid.org”
28694
“arc.gov.au”
47527
“orcid.org”
36493
“nhmrc.org”
34778
“orcid.org”
14448
“arc.gov.au”
24392
“arc.gov.au”
53770
“orcid.org”
03655

Finding the most common funder

sourcefundercount
“arc.gov.au”“arc.gov.au”32338
“nhmrc.org”“National Health and Medical Research Council”28465
“orcid.org”“Australian Research Council”7980
“orcid.org”“National Health and Medical Research Council”4310
“orcid.org”“Canadian Institutes of Health Research”649
“orcid.org”“European Commission”513
“orcid.org”“Engineering and Physical Sciences Research Council”
381
“orcid.org”“Japan Society for the Promotion of Science”367
“orcid.org”“Marsden Fund”
254
“orcid.org”“National Natural Science Foundation of China”234

Finding the funders giving the highest funding amount

sourcefunderfunding_currencyfunding_amount
“arc.gov.au”
“arc.gov.au”
null15987295944
“nhmrc.org”“National Health and Medical Research Council”null15663446606
“nhmrc.org”“Australian Research Council”null25980289
“crossref.org”“H2020 European Research Council”“AUD”
17801261.919999998
“crossref.org”“Wellcome Trust”“AUD”
13433207.1
“crossref.org”“Australian Research Data Commons”“AUD”
7002139
“crossref.org”“HORIZON EUROPE European Innovation Council”“AUD”
5817155.4399999995
“crossref.org”“Fundação para a Ciência e a Tecnologia”“AUD”
2926727.5999999996
“crossref.org”“HORIZON EUROPE Marie Sklodowska-Curie Actions”“AUD”
2408490.88
“crossref.org”“American Heart Association”null1285240

Researcher exploration

Finding the researchers who received multiple grants

sourcefull_namecount
“orcid.org”null145
“orcid.org”“Peter von Dadelszen”92
“orcid.org”“Paul Scuffham”77
“orcid.org”“David Cooke”75
“orcid.org”“David Burt”72
“orcid.org”“Bruce Tonge”62
“orcid.org”“Meera Agar”59
“orcid.org”“Alexander Davies”59
“orcid.org”“Hui-yao Lan”55
“orcid.org”“Jonathan Carapetis”54

Finding the Countries with the highest number of researchers that have published or are yet to publish

sourcecountrycount
“orcid.org”“AU”140968
“orcid.org”“US”125958
“orcid.org”“GB”88622
“orcid.org”“CN”46758
“orcid.org”“DE”38624
“orcid.org”“FR”26293
“orcid.org”“CA”24127
“orcid.org”“ES”23971
“orcid.org”“IT”23743
“orcid.org”“IN”22295

Publication Exploration

sourcepublication_typecount
“crossref.org”“journal-article”1645479
“scopus.com”“Journal”948196
“pubmed.gov”“journal”559538
“crossref.org”“proceedings-article”91975
“crossref.org”“book-chapter”86193
“scopus.com”“Conference Proceeding”61598
“crossref.org”“posted-content”47917
“orcid.org”“journal-article”36130
“scopus.com”“Book”33057
“scopus.com”“Book Series”20585

Conclusion

The Australian National Graph serves as a comprehensive research infrastructure of Persistent Identifiers, providing valuable insights into funding allocation, researcher activity, and publication trends. The use of Cypher queries with Neo4j allows for efficient exploration and analysis of this data. By tracking researchers, grants, and publications, the graph helps identify key research contributors and partnerships. Additionally, the funding analysis helps institutions optimise resource allocation and identify underfunded research areas.

About the Author

Data Scientist at Research Graph Foundation |  + posts