In-memory semantic knowledge graph that converts personal documents into topic-aware discovery, visualizations, and reproducible experiment logs.

Quick Facts

  • Context: CS584 NLP Knowledge Graph Project
  • Tech Stack: Python 3.13+, sentence-transformers (mpnet), KeyBERT, TextRank, NumPy, uv
  • Links: GitHub Repo | Project Report (PDF)

Overview and Problem

The project builds a lightweight knowledge graph from local documents to improve semantic search and topic discovery. It aims to evaluate different tag extraction strategies and similarity thresholds for optimal graph connectivity.

What I Built

  • Engineered an automated pipeline to extract semantic tags using LLM-backed generation with statistical fallbacks (KeyBERT, TextRank).
  • Implemented embeddings using sentence-transformers (mpnet models) to map tags into semantic space.
  • Designed a cosine matching algorithm with configurable k and threshold parameters to link documents to topic centroids.
  • Developed interactive HTML graph exports and structured text dumps for visualization and debugging.

Key Results and Impact

  • Achieved best semantic coherence of 0.97 (LLM, k=3, threshold=0.7).
  • Reached strongest connectivity with an LCC ratio of 1.00 (TextRank, k=5 or k=7, threshold=0.3).
  • Handled dynamic topic counts ranging from 32 to 475 across thresholds, demonstrating a clear granularity tradeoff.

Core Learnings

  • Balanced midpoints yielded coherence around 0.77–0.81 with LCC 0.50–0.57 at k=3, threshold=0.5 across different extractors, showing the trade-offs between precision and connectivity.

Related: Projects MOC