A production-grade Python research system implementing a Two-Stage Semantic Cache (“Dragnet & Sniper”) for autonomous LLM workflows, achieving massive cost reductions.
Quick Facts
- Context: CS 800 Special Problems in CS (Spring 2026, In Progress)
- Tech Stack: Python, FAISS, Anthropic Claude, Hugging Face (Qwen3 Embeddings/Reranker)
- Links: None available
Overview and Problem
High-volume, multi-step LLM agents are often economically unviable due to redundant generative workloads and API costs. This project solves this by creating a hybrid local/cloud semantic caching architecture that safely stores and reuses verified answers without hallucination drift.
What I Built
- Engineered a “Dragnet & Sniper” two-stage semantic cache using FAISS and cross-encoder relevance gates to prevent catastrophic cache collisions.
- Implemented an autonomous workflow with SemanticCacheController that handles document retrieval, synthesis, grounding, and persistent storage.
- Designed a robust provenance system that grounds every cached entry against source chunks and verifies it with a secondary LLM.
- Automated knowledge extraction to decompose synthesized answers into queryable
(subject, relation, object)triples for an emergent knowledge graph. - Built a dynamic routing system to dispatch tasks efficiently between executor-class (Claude 3.5 Sonnet) and evaluator-class (Claude 3.5 Haiku) models.
Key Results and Impact
- Achieved a massive 96.7% cost reduction on redundant generative workloads without sacrificing accuracy.
- Established a rigorous empirical benchmark repository targeting long-context suites like RULER v2, NoLiMa, LongBench v2, and LegalBench.
- Improved subsequent hit latency using local CPU-bound model infrastructure.
Core Learnings
- Demonstrated that semantic caching paired with strict provenance checks and corpus-scoped isolation allows safe, reproducible answer reuse in complex workflows.
Related: Projects MOC