Robert Caulk runs Emergent Methods, a research lab building news knowledge graphs. With a Ph.D. in computational mechanics, he spent 12 years creating open-source tools for machine learning and data analysis. His work on projects like Flowdapt (model serving) and FreqAI (adaptive modeling) has earned over 1,000 academic citations.
His team built AskNews, which he calls "the largest news knowledge graph in production." It's a system that doesn't just collect news - it understands how events, people, and places connect.
Current AI systems struggle to connect information across sources and domains. Simple vector search misses crucial relationships. But building knowledge graphs at scale brings major technical hurdles around entity extraction, relationship mapping, and query performance.
Emergent Methods built a hybrid system combining vector search and knowledge graphs:
- Vector DB (Quadrant) handles initial broad retrieval
- Custom knowledge graph processes relationships
- Translation pipeline normalizes multi-language content
- Entity extraction model identifies key elements
- Context engineering pipeline structures data for LLMs
Implementation Details:
Data Pipeline:
- All content normalized to English for consistent embeddings
- Entity names preserved in original language when untranslatable
- Custom Gleiner News model handles entity extraction
- Retrained every 6 months on fresh data
- Human review validates entity accuracy
Entity Management:
- Base extraction uses BERT-based Gleiner architecture
- Trained on diverse data across topics/regions
- Disambiguation system merges duplicate entities
- Manual override options for analysts
- Metadata tracking preserves relationship context
Knowledge Graph:
- Selective graph construction from vector results
- On-demand relationship processing
- Graph queries via standard Cypher
- Built for specific use cases vs general coverage
- Integration with S3 and other data stores
System Validation:
- Custom "Context is King" benchmark suite
- RAGAS metrics track retrieval accuracy
- Time-split validation prevents data leakage
- Manual review of entity extraction
- Production monitoring of query patterns
Engineering Insights:
Key Technical Decisions:
- English normalization enables consistent embeddings
- Hybrid vector + graph approach balances speed/depth
- Selective graph construction keeps costs down
- Human-in-loop validation maintains quality
Dead Ends Hit:
- Full multi-language entity system too complex
- Real-time graph updates not feasible at scale
- Pure vector or pure graph approaches insufficient
Top Quotes:
- "At its core, context engineering is about how we feed information to AI. We want clear, focused inputs for better outputs. Think of it like talking to a smart friend - you'd give them the key facts in a way they can use, not dump raw data on them." - Robert
- "Strong metadata paints a high-fidelity picture. If we're trying to understand what's happening in Ukraine, we need to know not just what was said, but who said it, when they said it, and what voice they used to say it. Each piece adds color to the picture." - Robert
- "Clean data beats clever models. You can throw noise at an LLM and get something that looks good, but if you want real accuracy, you need to strip away the clutter first. Every piece of noise pulls the model in a different direction." - Robert
- "Think about how the answer looks in the real world. If you're comparing apartments, you'd want a table. If you're tracking events, you'd want a timeline. Match your data structure to how humans naturally process that kind of information." - Nico
- "Building knowledge graphs isn't about collecting everything - it's about finding the relationships that matter. Most applications don't need a massive graph. They need the right connections for their specific problem." - Robert
- "The quality of your context sets the ceiling for what your AI can do. You can have the best model in the world, but if you feed it noisy, unclear data, you'll get noisy, unclear answers. Garbage in, garbage out still applies." - Robert
- "When handling multiple languages, it's better to normalize everything to one language than to try juggling many. Yes, you lose some nuance, but you gain consistency. And consistency is what makes these systems reliable." - Robert
- "The hard part isn't storing the data - it's making it useful. Anyone can build a database. The trick is structuring information so an AI can actually reason with it. That's where context engineering makes the difference." - Robert
- "Start simple, then add complexity only when you need it. Most teams jump straight to sophisticated solutions when they could get better results by just cleaning their data and thinking carefully about how they structure it." - Nico
- "Every token in your context window is precious. Don't waste them on HTML tags or formatting noise. Save that space for the actual signal - the facts, relationships, and context that help the AI understand what you're asking." - Nico
Robert Caulk:
Nicolay Gerold:
00:00 Introduction to Context Engineering
00:24 Curating Input Signals
01:01 Structuring Raw Data
03:05 Refinement and Iteration
04:08 Balancing Breadth and Precision
06:10 Interview Start
08:02 Challenges in Context Engineering
20:25 Optimizing Context for LLMs
45:44 Advanced Cypher Queries and Graphs
46:43 Enrichment Pipeline Flexibility
47:16 Combining Graph and Semantic Search
49:23 Handling Multilingual Entities
52:57 Disambiguation and Deduplication Challenges
55:37 Training Models for Diverse Domains
01:04:43 Dealing with AI-Generated Content
01:17:32 Future Developments and Final Thoughts