V0: Building KG (Buehler, 2024)
Last updated
Last updated
The baseline methodology for knowledge graph construction closely follows the approach introduced by Buehler (2024). The knowledge graph generation process follows four main phases:
Data Acquisition: Collecting research papers from various sources
Data Distillation: Converting research papers into structured text chunks
Local Knowledge Graph Generation: Extracting triples from each text chunk
Global Knowledge Graph Integration: Combining local graphs and refining the result
The process begins with a curated dataset of scientific papers on longevity and aging. This dataset was seeded using HALL (Li et al., 2024), a collection of 136 publications from the past two decades. To expand the dataset, we first extract citations from these papers, retrieving earlier research referenced by HALL papers. Next, we identify papers that have cited works from the HALL dataset, capturing newer research that builds upon these publications. Together, these steps construct a one-degree citation network, linking foundational studies with their academic impact over time.
To ensure efficient processing, markup documents are split into manageable text chunks. We first divide each scientific publication into chunks of 800-1000 tokens, with a 100-token overlap to maintain context.
Once text chunks are created, an LLM generates distilled insights for each chunk:
Summary: A concise overview of key information from the chunk.
Bulleted Insights: A detailed breakdown of significant findings.
This distilled content serves as the "raw context" for subsequent graph generation.
For each chunk's raw context, the system extracts knowledge in the form of triplets:
Triplet Extraction: The system prompts the LLM to identify key concepts and relationships, outputting them as triplets in the form:
Refinement: The initial triplets are refined to ensure consistent labeling and terminology via iterative prompting.
Multiple local graphs originating from is then merged into a global knowledge graph:
Node Embedding: Sentence transformers (e.g., "all-MiniLM-L6-v2") generate vector embeddings for each node
Similarity Detection: Similar nodes are identified using cosine similarity with a configurable threshold (default 0.85)
Node Merging: Similar nodes are merged, with the highest-degree node preserved and connections redirected. Mergin process preserves all edge relationships.
Buehler, M. J. (2024). Accelerating scientific discovery with generative knowledge extraction, graph-based representation, and multimodal intelligent graph reasoning. Machine Learning: Science and Technology, 5(3), 035083. https://doi.org/10.1088/2632-2153/ad7228
Hao Li, Song Wu, Jiaming Li, Zhuang Xiong, Kuan Yang, Weidong Ye, Jie Ren, Qiaoran Wang, Muzhao Xiong, Zikai Zheng, Shuo Zhang, Zichu Han, Peng Yang, Beier Jiang, Jiale Ping, Yuesheng Zuo, Xiaoyong Lu, Qiaocheng Zhai, Haoteng Yan, Si Wang, Shuai Ma, Bing Zhang, Jinlin Ye, Jing Qu, Yun-Gui Yang, Feng Zhang, Guang-Hui Liu, Yiming Bao, Weiqi Zhang, HALL: a comprehensive database for human aging and longevity studies, Nucleic Acids Research, Volume 52, Issue D1, 5 January 2024, Pages D909–D918, https://doi.org/10.1093/nar/gkad880