How Persistent Identifiers Are Creating a Navigable Map of Knowledge

Figure 1: A conceptual illustration of a knowledge graph focused on research connections, generated by DALL·E 3 on Feb 28, 2025 by author

This blog post is based on and summarises the paper “Connected Research: The Potential of the PID Graph” by Helena Cousijn, Ricarda Braukmann, Martin Fenner, Christine Ferguson, René van Horik, Rachael Lammey, Alice Meadows, and Simon Lambert, published in 2021. The original paper introduced the concept of the PID Graph as a transformative approach to connecting research entities through persistent identifiers.

Introduction

In today’s digital research landscape, connecting the dots between various research outputs has become increasingly challenging. Each day, thousands of new papers are published, datasets are created, and research grants are awarded. Yet despite this increase in information, finding meaningful connections between these elements can feel like searching for specific stars in a vast galaxy without a map.

Persistent Identifiers (PIDs) – the digital addresses that never change play a crucial part in navigating the research landscape. These unique, machine-readable codes, such as DOIs (Digital Object Identifiers) for publications or ORCID IDs for researchers, serve as a permanent digital address assigned to a research entity that remains stable even if the location of that entity changes. But simply having these identifiers isn’t enough to reveal the rich web of connections that exist in the research ecosystem.

The PID Graph, introduced in the paper “Connected Research: The Potential of the PID Graph” by Cousijn et al. (2021), represents significant progress in how we think about research infrastructure. Rather than viewing PIDs as isolated identifiers, the PID Graph connects them to create a navigable map of research relationships. This approach allows us to answer questions that were previously difficult to address: Which datasets underpin a particular paper? What outputs resulted from a specific grant? Who are the potential collaborators working on similar research?

This article follows the paper to explore how the PID Graph works, how it differs from other knowledge graph approaches in research, and why it matters for the future of connected scholarly communication. By understanding these connections, we gain not just a tool for navigation, but a new lens through which to view the interconnected nature of research itself.

The Foundation: Understanding Persistent Identifiers

As the paper emphasises, PIDs are far more than simple identifiers – they are critical enablers for FAIR (Findable, Accessible, Interoperable, Reusable) research.

The paper identifies several key ways that PIDs and their metadata support research:

Discoverability and accessibility: PIDs uniquely identify research objects, people, institutions, and funding reliably over time, making them easier to find and verify. For example, a DOI will always point to the same publication, even if the publisher changes its website structure.

Usability and citation: PIDs point directly to specific items or versions, increasing usability and enabling proper citation of diverse research outputs beyond just publications. This means researchers can cite not only papers but also datasets, software, and other digital objects with confidence.

Assessment: PIDs facilitate reliable measurement of impact by providing a consistent reference point for tracking usage. Funding agencies can track how often research they supported is cited or reused.

Interconnection and interoperability: PIDs create an open network of specifically identified entities that supports collaboration across disciplines, institutions, and countries. For instance, connecting researcher ORCID IDs with publication DOIs creates a traceable network of scholarly contributions.

The authors emphasise that PIDs need supporting infrastructure to function properly – they require “a long-term commitment to maintain the service by an organisation that is equipped to make such a commitment.” This highlights that a PID is more than just a string of characters; it’s backed by services, standards, and communities that ensure its effectiveness.

The authors also note that PIDs are assigned to a wide range of scholarly entities, from the well-established (publications, datasets, researchers) to the emerging (grants, organisations, software). This variety of PID types forms the foundation upon which the PID Graph can be built, creating connections across the research ecosystem through these stable reference points.

By focusing on how PIDs serve as the fundamental building blocks for connected research infrastructure, the paper lays the groundwork for understanding how the PID Graph represents the next evolution in making research more discoverable, connected, and reusable.

The Landscape of Connected Research

The challenge of connecting research information has given rise to various knowledge graph approaches. A knowledge graph represents entities and their relationships in a structured way that both humans and machines can understand.

The Research Graph, for instance, takes a comprehensive approach by modelling five primary entity types: researchers, publications, research data, grants and organisations.

Other initiatives like the Open Research Knowledge Graph and OpenAIRE Research Graph take similar entity-centric approaches, though with variations in their models and implementations.

These knowledge graphs typically begin with the entities themselves and use various methods to discover and represent connections. They aim to model the research ecosystem comprehensively, capturing the nuances of different entity types and their complex relationships.

The PID Graph: A Focused Approach

The PID Graph takes a fundamentally different approach. Rather than starting with entity types, it puts PIDs themselves at the centre of the model. As the paper explains, the PID Graph “inverts this relationship and takes PIDs themselves as the basic entities that are linked together; whatever they refer to is left implicit.”

This distinction may seem subtle, but it has profound implications for how the graph is constructed and used. By focusing on identifiers rather than the entities they represent, the PID Graph prioritises reliable, well-documented connections with clear provenance.

Think of it like the difference between a map showing cities and the roads between them (traditional knowledge graphs) versus a map showing highway numbers and their intersections (the PID Graph). Both represent the transportation system, but with different emphases and strengths.

The PID Graph approach offers several advantages:

However, this approach also requires that PID metadata be sufficiently rich to represent the relationships of interest and that the PIDs themselves be of high quality.

 Figure 2 shows the scale of the connections between these entities in August 2021.

Figure 2: Connections between Entities in the PID Graph, August 2021. (Recreated by author from “Connected research: The potential of the PID graph” by Cousijn et al., (2021).

Technical Implementation

Implementing the PID Graph requires two main components:

1. Backend Services

These services collect PID connections in a standardised way, capturing not only the PIDs but also the provenance details (who made the connection, when, and how). For example, the Crossref/DataCite Event Data service aggregates connections between DOIs, links to ORCID profiles, funding details, and even mentions on external platforms like Wikipedia.

A concrete example of how this works:

Standards like Scholix (Scholarly Link Exchange) help ensure that these connections are recorded consistently. Scholix provides a common information model for exchanging information about the links between scholarly literature and data.

2. Query Interfaces

The PID Graph uses GraphQL, an open-source query language for APIs. Unlike traditional API approaches that might require multiple calls to different endpoints, GraphQL allows clients to request exactly the data they need in a structured format.

This provides a standardised interface that can be federated across systems, making it easier to build applications that leverage the PID Graph. Researchers don’t need to make multiple API calls to different services – a single GraphQL query can pull together information across multiple PID types and their relationships.

Real-World Applications

The true value of the PID Graph is revealed in its practical applications:

Getting Started with PIDs

If you’re looking to engage with the PID ecosystem, here are some practical steps:

For individual researchers:

  1. Register for an ORCID ID: Visit ORCID.org and create a free account. This unique identifier distinguishes you from other researchers, even those with similar names.
  2. Maintain your ORCID profile: Regularly update your profile with new publications, affiliations, and other professional activities.
  3. Use your ORCID ID consistently: Include it when submitting papers, applying for grants, and in other professional contexts.
  4. Ensure your data gets DOIs: When publishing datasets, deposit them in repositories that assign DOIs (such as Zenodo, Figshare, or domain-specific repositories).
  5. Link your identifiers: Whenever possible, connect your ORCID to your publications and datasets by including it during the submission process.

For data managers:

  1. Implement DOIs for datasets: Partner with services like DataCite to assign DOIs to your institution’s research data.
  2. Include rich metadata: Ensure dataset records include comprehensive metadata, particularly relationships to other PIDs (publications, researchers, grants).
  3. Follow metadata standards: Adhere to community standards for metadata to ensure interoperability.
  4. Establish clear workflows: Create processes for consistently capturing and recording PID relationships.
  5. Promote PIDs within your organisation: Educate researchers about the importance of PIDs and how to use them properly. Join the community discussion about PIDs.

For institutions:

  1. Adopt a Research Organisation Registry (ROR) ID: Register your institution at ROR.org to obtain an official organisational identifier.
  2. Encourage consistent use: Promote the use of your institution’s ROR ID in affiliations across publications, grants, and other research outputs.
  3. Integrate PIDs into institutional systems: Incorporate ORCID IDs, DOIs, and other PIDs into your institution’s research information systems.
  4. Establish PID policies: Create clear policies encouraging or requiring the use of PIDs for research outputs.
  5. Provide training: Offer workshops and resources to help researchers understand and use PIDs effectively.

For publishers:

  1. Ensure all publications have DOIs: Implement DOIs for all article types and other scholarly outputs.
  2. Collect related identifiers: During submission, request information about related datasets, software, and other research outputs with their PIDs.
  3. Require ORCID IDs: Ask authors to provide their ORCID IDs during submission.
  4. Share connection data: Make the relationships between different PIDs openly available through services like Crossref Event Data.
  5. Enhance metadata display: Show related resources (datasets, code, etc.) prominently in article metadata and user interfaces.

Challenges and Future Directions

While the PID Graph offers exciting possibilities, it also faces challenges such as inconsistent adoption of PIDs, gaps in metadata quality, and the need for sustainable infrastructure. The paper concludes with three key recommendations for the research community:

  1. Use PIDs for all entities in the research process. This seems straightforward, but many research entities still lack PIDs. For example, while publications and increasingly datasets have DOIs, software, experiments, and protocols often lack persistent identifiers. Various stakeholders—from researchers assigning ORCID IDs to institutions using ROR IDs—must adopt PIDs consistently.
  2. Track and record connections between PIDs. Simply having PIDs isn’t enough; the relationships between them must be recorded in metadata. This requires infrastructure providers to offer relevant metadata schemas and researchers and institutions to include this relational information.
  3. Make connections openly available. For the PID Graph to reach its full potential, the connection information must be openly shared through infrastructure providers like DataCite and Crossref.

Looking ahead, future development could focus on:

Researchers are also exploring combining verification with retrieval-augmented generation to balance factual accuracy and broad knowledge access. Additionally, integrating trustworthy data sources like knowledge graphs or expert-curated databases could further enhance reliability.

Conclusion: The Connected Future of Research

The PID Graph marks a transformative step in research infrastructure. By focusing on persistent identifiers rather than traditional entities, it creates a reliable, scalable map of research relationships that enhances discovery, attribution, and analysis. As the research landscape becomes more digital and interconnected, embracing the PID Graph—and the principles of FAIR research it supports—will be crucial. Not only does it offer a clear path toward better organisation and accessibility, but it also opens up exciting new avenues for collaboration and innovation across disciplines.

For medium version of this article, please visit this link

References

Cousijn, H., Braukmann, R., Fenner, M., Ferguson, C., van Horik, R., Lammey, R., … & Lambert, S. (2021). Connected research: The potential of the PID graph. Patterns2(1).https://pmc.ncbi.nlm.nih.gov/articles/PMC7815961/.

About the Author

Intern at Research Graph Foundation | ORCID |  + posts