S2E24

#041 Context Engineering, How Knowledge Graphs Help LLMs Reason

February 6, 2025 · 01:33:35

Nicolay Gerold: Building
context is all anyone is doing.

In traditional ML, we spent a lot
of time constructing features.

We crafted, selected, and transformed
raw data into a form that the models

can learn from and context engineering.

Is feature engineering for LLMs.

And for context engineering, you
first, we try to curate input signals.

So each piece of context acts like a
feature that informs the model's output.

And we want to remove noise like
HTML tags or unnecessary punctuation.

And we also want to
eliminate distractions.

So we want to trim unnecessary
details to prevent the model

from getting sidetracked.

So basically we.

Optimized for a high signal to noise
ratio and we want clear focused inputs.

So we can get better outputs.

The second part is we try
to give it a structure.

So raw data often is
messy and unstructured.

We normalize it and code
categorical variables.

We create interaction terms.

And for LLMs, we have
to transform raw text.

into structured formats.

So we use bullet points, tables,
or type lines to break down complex

information and present it in a way.

So the LLM can digest it best.

So we are creating structured
and meaningful representations

of the information.

And there are two things I like to think
about first, how would I present this

information to a colleague, to a body
of mine, so he can make a good decision.

And depending on the type of
information, this is very different.

I think like a good example for is like an
Airbnb case I've seen once, like how could

you represent the different information
if I want to showcase like a single

apartment, I do it more like a card where
I can see a picture, a title and some key

attributes when I want to allow the user.

To basically decide between
multiple different selected homes.

I might make it more like a table
where I can quickly scan for selected

attributes and compare them across
5 to 10 different apartments.

The goal of the user really determines
how you should present the information.

And also, the task you give the
LLM should really determine how

you present the information.

And you really want to create
this meaningful representation.

And another question I like to
ask myself is in which context

is the answer on the internet.

So basically, I give the LLM a
bunch of information and I ask it

to answer something or to do a task.

And what I like to ask myself is
like, how does the context look on the

internet before the answer is given?

And I try to reconstruct that.

And the third part of context
engineering is refinement.

So basically, the feature
sets In ML, evolve over time

through iteration and testing.

Likewise, context engineering
is an ongoing process.

You want to refine your context
based on performance feedback

to improve the accuracy.

Or the task completion rate, whatever
you're going for off the LLM.

So you want to test and tweak a lot.

Try out different stuff, different
prompts, different models.

And you want to evaluate how changes
in context affect the model output.

And you want to use these insights
basically to continuously improve the

quality of the engineered context.

And what, in my opinion, is really
important in that you want to make

it easy and fast to make a change.

Tested, evaluated, and then
put it into production.

So the CACD part in LLMs,
I think is something that's

really overlooked at the moment.

And the last balance,
breadth and precision.

So in traditional ML, there
was always a trade off between

including too many features.

So risking overfitting
and noise and too few.

So you likely miss critical
information with LLMs.

The context window is really precious.

Every token counts too much.

CO to context can sidetrack the model.

Just as too many features can
lead to a fitting in ml and.

You need enough information to basically
ground the model, but not so much that it

becomes overwhelmed by irrelevant details.

And the quality of the engineered
features in the end often sets the

ceiling for ML model performance.

And I think the quality of the context
sets the ceiling of LLM model performance.

If your context is very noisy and contains
information that you usually Don't see

in this you are likely getting worse
outputs as if you clean it up And we

want to transform raw data in really
concise high fidelity context that don't

contains any distractions so you can
get the reliable and actually actionable

insights you actually want and Today we
are talking to Robert Koch and Robert is

the CEO of emergent methods and emergent
methods It's a research lab researching

context engineering for LLMs, and
they have built asknews, which is the

largest knowledge graph in production.

And Rob and I talk a lot about context
engineering, but also knowledge

graphs and how they can be used in
tandem with LLMs and vector search.

And we really want to go into how you
can get the best results from your LLMs.

Let's do it.

for me, an agent, the
definition goes after like the

reinforcement learning definition.

I think agent has become a marketing
term, like in a technical sphere, agent

doesn't carry any value anymore, because
if I tell you like implement an agent.

You wouldn't know what to do with it.

So I think as like technical
perspective, like the technical

domain, we should scratch the term.

We shouldn't use it anymore and come
up with like new definitions, which

are made way more specific, like
what are we actually implementing?

Marketing perspective.

I think when you're interacting
with non technical people, I think

it's always good to have such terms
because it abstracts away a lot.

They think they know what
you're talking about.

Everything has something in mind.

Won't be what you're actually implementing
anyhow, but this is like all the

time when you're implementing stuff.

Robert Caulk: I agree.

But yeah, going back to the RL.

I think my definition fits comes
from RL as well, where the it's doing

a lot of decision making, right?

The point of the agent is that
it's operating in an environment

learning the policy, and that policy
is some functional description of

whatever you're trying to achieve.

But it's complex, it's hyper
dimensional, it's difficult to

map, and that's why you need the
agent self exploring in some ways.

But, yeah, I agree.

The abstraction, especially for
interacting with clients, it's

really important to have these
words that just communicate to

them that you know what's going on.

Because the client isn't, the point is
they're coming to you for expertise.

buzzwords.

It's an unfortunate reality
of interacting with clients.

At least that's my experience.

Nicolay Gerold: Yep.

I think it already brings us
a little bit in the direction

of like context engineering.

What is your experience in having
agents like something like QAI

or the link chain agents actually
self construct their context?

Are they able to pull from different
data sources you have internally

and actually pick the right ones?

With the right queries.

Robert Caulk: Yeah, it's what level of
abstraction should you target when you're

developing agents and you're developing
some system for building context.

I think at the end of the day, it,
building context is all anyone is

doing if they're feeding LLMs because,
you're trying to extract some insight.

You want to leverage the reasoning,
the ability of that LLM to actually.

synthesize and summarize and
make recommendations and leverage

training knowledge on top
of whatever you're building.

But what you're building is the context.

And that stems from the context
engineering and the ability

to build it is hard for sure.

My experience, answering your question
directly, my experience is that this

is extremely difficult because the,
if you give it too much freedom,

it's really hard to quality control.

And it's really hard to, especially if
you're chaining multiple agents together,

this guy fetches something, which makes
decisions about what this guy fetches.

And then that guy constructs
context, controlling these

things becomes very difficult.

And in the world where You need
strict quality controls and your

context, requires some adherence
to accuracy and relevance.

They can go off the rails very quickly.

And so for me, usually I try to
go keep it as simple as possible.

And a lot of times that
means it's really very basic.

This function has one responsibility.

It generates this exact
structure of information.

And then that structure allows me to
then go find other pieces of information.

I have full control over it.

Instead of all of this autonomy
it's a lot more of just functions

with a little bit of reasoning.

One example, maybe it's better
to operate in the example world.

Generating a question tree is
really good for good retrieval.

You wanted, you want to really maximize
your retrieval from a database.

The first thing you can do is take a user
query and then say, Hey, what question

tree should I build in order to really
flesh out this entire, the response?

And building a question tree is
really a function, but you do need

some LLM to just get that tree.

But it's not really an agent,
because it's not super self directed.

It has one goal, and that
is to build a question tree.

And then you can take the
questions and then search on those.

So to me, I try to stick to that level
of simplicity because it makes things

a lot more maintainable, sustainable
just easier for other engineers to

work with in the future when you
need other engineers to come in

and maintain it and think about it.

That's maybe there are other
companies that are operating on

a much higher abstraction level
and all of the self autonomy.

We are focused on
delivering very direct news.

And therefore, it's a lot harder for
us to just let things go off the rails.

We need to be very regimented
in how we report information and

making sure that information is
strictly adhering to guidelines.

And that comes from a lot of
very strict function calls.

Nicolay Gerold: Yeah, I, when I have
to give SQL, I really for example,

to construct views because you can
actually control what they query and how

you can give clear schemas as opposed
to feeling like 10 different tables,

hey, and do whatever joints you want.

You already mentioned like
the context quality part.

What are the different technical
components of context that

you think matter the most
for a high context quality?

Robert Caulk: Yeah,
that's a great question.

In my opinion, it's about strong
metadata and painting a high

fidelity contextual picture.

And obviously that's a bunch
of kind of abstract jargon, but

that comes from good metadata.

So the stronger the metadata that
you give, it seems the stronger

the general business out insight.

And that is, there's some intuition there.

It's like this.

If I want to forecast.

Whether or not Kamala Harris is going
to win the presidential election.

Let's start with what kind of contextual
picture we should be building.

Let's say we know Biden exited the race.

That gives us a piece of, that, that
paints a bit of the contextual picture

about what kind of forecast we're going
to be making for Kamala winning this race.

Do, if we know when he exited
the race, when was this reported?

That is another piece of metadata
that comes in very handy.

If we if he exited two days before
the election, obviously that changes

the forecast one hundred percent.

If we know that he exited six months
before, that's a very good piece of

metadata to continue to paint the picture.

And so if you extend the limit
case on that, how much metadata

can we add to news reports?

What about reporting voice?

Knowing the voice that
this report was written in.

Was it sensational, persuasive,
objective analytical, investigative?

That gives you a contextual data point of
how to take this information into account.

Let's say we got, ten persuasive articles
about how Russia is the greatest country

in the world, and it's going to win,
and it's very persuasive about why

Ukraine should be a part of Russia.

That is context, and it is useful.

But knowing that it's sensational,
knowing that it's persuasive, knowing

that it's coming out of Russia
lets the LLM reason on top of that.

And then if you include, you start
painting the rest of the picture.

Okay, how does the rest of the world
thinking about this Russian invasion?

Obviously Ukraine is on the opposing
side, and knowing what comes out

of Ukraine is equally important,
and what kind of reporting voice

they're using is equally important.

You can obviously continue out into
all of the countries and continents

and then all of a sudden you've painted
this very high fidelity picture of

what's going on and you can make a
reasonable, you can then leverage the

reasoning of that LLM to forecast what's
the highest probability scenario of

Russia backing out of Ukraine right now?

Russia thinks that it owns Ukraine
and it should be a part of it.

That tells me that they're
not going to back out.

Ukraine thinks this.

But then you have other countries that
are making speculations from the outside

about, Hey, it looks like there are North
Korean troops that are being needed.

That's a, that's an indication
that there's weakness here.

There's, but then there's
like a demand for oil.

All of these little context data points
and the metadata associated with them

ends up giving you a really high quality,
high probabilistic next token forecast.

That's at least how we see it when
you're, when we talk about you

leveraging LLMs for forecasts.

Nicolay Gerold: Do you ever
think about how to best represent

the context based on the query?

For example, if it's more something
I have to compare and contrast.

So I asked the LLM to compare
different Airbnbs, for example.

So I probably have a location.

I have a description.

I have some metrics like
square footage price.

So I'm probably best off
representing it as a table.

I would assume when I have like
events, like you just talked

about, like the presidential
race, it's probably a timeline.

Is it, do you have a set of different
representations you basically

feed in all the different metadata
dynamically, basically like a template?

Robert Caulk: That's a good question.

For our forecasting endpoint,
indeed we make a timeline.

So you're already, you're ahead of me.

Because yeah.

Timeline is a really nice, succinct way
to indicate the evolution of information.

We make sure that the timeline
is in chronological order, right?

That's a very small detail, but it plays
a big role, especially when you're talking

about LLMs that are not great with numbers
and you're giving it a bunch of dates.

But if you're giving it dates
and you're saying, hey, this

is the start of the timeline.

This is the end of the timeline.

You're just helping nudge
it in the right direction.

Yeah, it depends on the endpoint.

It depends on what type of workflow
we're trying to give to you.

A lot of our, fundamental
endpoints at Ask News.

Are geared at letting you
make a lot of those decisions.

We're just giving you the real basic
Legos and making sure they're strong,

good Legos produced at the Lego
factory instead of those like weak,

fake knockoff lot Legos that you
might get from one of our competitors.

Just kidding.

But it's true that the stronger
that base Lego, then you can

go decide how to format it.

All of the winners from Metaculous
forecasting tournament, they're deciding

how to use our Legos on their own.

They may be making their own timelines.

They may be, deciding only
to take subsets of the Legos.

When you were mentioning that, it brought
out I had an an interesting thought.

If I were trying to be a very generalized
task research endpoint, yeah, making

a, having an agent actually decide the
format or configuration that's nice.

That's like a very dynamic approach
to context engineering, where

you outsource that abstraction to
another LLM to say, hey, here's the

query, here's the data that we've
researched associated with the query.

Now build me, you decide the best
type of context configuration and

format of all of this information
that will get me the best answer.

That's a pretty, pretty interesting idea.

Nicolay Gerold: And how do you actually
measure the context quality you have?

Do you have some internal metrics?

You're going after or some KPIs

Robert Caulk: Yeah we
have this one benchmark.

It's called context is king.

And that benchmark is geared at
exactly this, evaluating the quality

of that context, but it's not that
good, if I'm being honest, and I

think that the evaluation industry
has a lot of maturation to it.

That it will see within the next decade.

Part of the problem is it's very
difficult to just decide what is

quality context and what isn't.

A lot of what we, a lot of the re
the the direction of quality for us

comes from research and trial and
error and building strong foundation

in order to get to good context.

But for sure there's tons
of ways to do it with Ragas.

You have retrieval accuracy, which is one
type, one side of context engineering.

Did you retrieve the right stuff?

That's a start.

And then you have, the accuracy,
the correctness is the information

inside of that context that you
need to get to your final answer.

Did the answer come out correctly?

Essentially was it able to actually
find the proper information?

So those are ways to do it,
but it's definitely a flawed

very subjective approach.

We try our best to make sure
the benchmark is an even

playing field but it's not easy.

It's a really hard, really challenging,

Nicolay Gerold: What do you think
makes context optimized for LLMs?

Robert Caulk: What first of all,
I think that conciseness is key.

So eliminating a lot of noise.

So trying to find the signal before
you get to that final call, there's

always some final call somewhere,
no matter how many additional calls

you use to build your context.

At the end of the day, you
want to get to some final.

Insight making sure that there's not
a ton of noise in that final call.

So that means, eliminating basic stuff
like eliminating tags even avoiding

HTML and raw text altogether, bringing,
building a concise synthetic picture.

of that information and being very
repeatable and in an expected way.

It allows you to get really good outputs.

A lot of this, the problem here is
when you're dealing with dirty HTML and

inconsistently formatted information and
trying to get them all together, the LLM

can spit something out that's coherent.

And that is a bit of a pothole.

In my opinion, because it looks like
it's made some good insight because,

whoa, I dumped a bunch of dirty data
in and I got something coherent out.

But if you start to look
closely, your hallucination

rate is going to be much higher.

Your accuracy is going to be much
lower because some of the noise that

you fed in is contributing to that,
the probability of the next token.

That is objectively true.

So the more, the cleaner you can make
that, the cleaner and more concise

the input can be, the better off it's
going to be in the end, not to mention

it's going to be cheaper, right?

Like your final call is always
your most expensive LLM.

It's the, it's your most high powered LLM.

You're paying a lot per token.

Why are you wasting?

You shouldn't waste.

That very valuable window with HTML
tags and ellipses and all of this.

The final point I'll make is at the
end of the day, these LLMs are just

pattern recognition machines, right?

There, there's a probabilistic
generation of a token.

That's amazing.

And it can do a lot of things, but if you
can build a pattern before your output,

a pattern that's clean and recognizable,
it's a, it's, you're going to catch

on to a lot more intuition of that
machine than if you just throw a bunch

of noise and it doesn't, it's getting
the, each token is getting a little

bit pulled in different directions just
because of all of this random noise.

Like the one example I'll give
you is HTML is loaded with.

Descriptions ending with ellipses,
this is just a huge problem that

for an LLM, because as soon as it
sees a, a sentence that ends with

dot, it just wants to finish that
sentence, even if it's not doing it.

Outwardly in your output, when it's
trying to process all of this information,

it's a lot more likely that it's filled
in whatever it's thought should go

there, just because you've seeded it.

Wonder what goes here.

That's a very dangerous situation
when you're dealing with

geopolitical forecasts, right?

You need to be very clear
about what is what is not.

Hey, this is the first half
of a sentence Good luck.

That's not a good way to go with
so I think those are some of the

Methods we find work really well,
and that's what we do for people.

That is the service We say
hey, this is what is here.

This is what is not instead
of Here's a bunch of noise.

Good luck.

That's the biggest differentiation
between how we engineer context and

how maybe other services are hoping
you'll engineer on top of them.

Nicolay Gerold: Yeah.

One question I always like to ask when
I start with that is like, how does the

document look like that would contain
the answer to the question I'm asking?

And I think that's always an interesting
mind experiment to go through.

Okay what type of lingo do they use?

How is it formatted?

And then you can likely already
arrive in a direction of how you

should construct the context, so
you can elicit the right responses.

Robert Caulk: Yeah.

That's a great point.

And you can even make calls, right?

You could have an agent or
just another LLM call and

it's only job is to speculate.

Because what you're doing is you're
speculating, which is getting you

closer to maybe where you want to be.

And that's a little like HyDE I don't
know if you've heard of HyDE but this is a

Nicolay Gerold: explain shortly?

Robert Caulk: Sure.

HyDE is a method of exploration into a
database, especially a vector database,

where you take a user query and you
try to match it closer to the structure

of the document that was embedded.

And you maybe even guess at
the answer beforehand to get

closer to where it was embedded.

So if you have a bunch of three paragraph
embeddings for your vectors about

news, let's take it as an example.

And then a user query comes in and it
says, Hey, tell me about Trump and AI.

You could embed that six words or
whatever and into the same space

as your three paragraphs, but
you're not going to be as close.

But if you take a guess at what the
answer document should look like, and

then embed that, you're going to get
really close to all of the doc, the

Trump AI documents, even though this is
totally fake and it's completely useless.

It's very good for getting
closer to your retrieval.

Document because you'll just cut.

It's essentially a glorified
term expansion and when you

think about the search industry,

Nicolay Gerold: Yeah, so for
anyone who is curious and wants

to look it up, HyDE stands for
hypothetical document embedding.

I think what comes to mind for me is
like what most people forget and like

the whole obsession of like chunk
size 200 to 400 tokens is what do I

actually want to feed into my LLM?

And like, where is the answer?

If the answer is likely contained
in one chunk, but, or like in

one document, but in a longer
document, I probably shouldn't have.

I should just embed the whole document
and feed it into a longer LLM context

one and not rip that stuff apart.

If it's a cross document, I'm
probably good with chunking.

Robert Caulk: actually, on that point, I
think those decisions what you're talking

about there are really where I think a lot
of money is made in this industry, right?

Because you're talking about how
to optimize retrieval for your

specific use case like another
approach that I think is really fun.

and useful is maybe you do want to
chunk and you want to say, okay,

I'm going to chunk down to paragraph
size, but I'm going to store with

metadata in quadrant, or whatever
database you're using a pine cone.

We've yet.

I'm going to store with metadata,
the abutting 346 paragraphs.

So when I do my retrieval, I
definitely, the embedding is very

accurate because, if you want it to
embed maybe three pages, all of a

sudden you're losing a lot of data.

But if you embed just that one paragraph,
that embedding is very representative

of at least that particular paragraph.

So then I retrieve it and then
I just attach the metadata to it

that I've stored with that chunk.

And that's really powerful because
then your context still contains the

auxiliary information, which almost
certainly needs to be present in

order for you to make a good analysis.

But you saved how much
you needed to embed.

You saved the, yeah, the actual
inference embedding was cheaper

because it was fewer tokens that
you needed to actually to embed.

All of this adds up to potentially
a more accurate picture.

It depends obviously on your
use case, but I think it's.

There's so much creativity you can
put into chunking like you're talking

about and that in itself is a part of
context engineering because it's like,

how do I get to the information I need?

And that's really fun.

Really?

It's a really cool
industry to be a part of.

Nicolay Gerold: Yeah.

And especially like with multivectors.

You can do some really crazy
patterns because I don't really

have to embed the long document.

I can still split it up to four
different vector vectors for one

document and store it associated with one
document and retrieve of all of those.

Where do you actually think when
should people start considering to

add knowledge graphs to their search?

Robert Caulk: I think it depends a lot
on how disparate the relationships are

between your pieces of information.

If it, if, if you're living in a purely
semantic world, it makes no sense to,

to start integrating a knowledge graph.

But if you're living in a world
where, you know this piece of

clothing is actually this dress.

is actually very closely related
to this hat, because they match.

And that match is an abstract
understanding that you need to, that's

your ontology, that you need to apply.

Then it makes a lot of sense to build
a graph, or, you don't, might not even

need to build a graph, you can use
metadata to really start linking things.

That's, I think, the more difficult
decision to make, because I think most

applications do have these disparate
connections that need to be established.

But the question is more when you think
about integrating a true knowledge

database, a graph database, is how complex
and sophisticated do your queries need

to be on these types of relationships?

If you, if they don't need to be
that large or that sophisticated,

you may be able to get away with just
using metadata in a Postgres, right?

Or in a vector database.

So that's to me, that's where
the question really sits.

I think most applications
need that, those connections.

It's just more about at what point
do you need the graph database?

And honestly, I think it's
pretty rare, unfortunately for

the graph database providers.

Nicolay Gerold: And also I think like
when you can avoid it, you should.

Because when you start to have a separate
database for your knowledge graph as

well, you're already dealing with a
distributed system, which is something

you should avoid at all costs, if you can.

Robert Caulk: I don't know.

I'm curious what you think.

So does that mean you're against
the whole single responsibility

principle of development or.

Nicolay Gerold: It's, I believe,
we, the whole industry has over

indexed on distributed systems.

Because it is the de facto standard
for most things nowadays, but I think

machines have gotten so powerful that
you can do so much stuff on single

machines and you can also be creative
in how you use it that you often

don't need the single responsibility.

Paradigm and it's like
single responsibility.

I have one database which
handles all my queries.

I would say that's also a little
bit of single responsibility.

Robert Caulk: Yeah.

It's, it would be nice.

A lot of problem comes into place.

I don't know.

For us, we use a lot of UIDs, which
would violate what you're talking about,

where we do have multiple databases,
S3, holding a lot of information,

and we're tracking UIDs or IR IDs.

The IDs that you can actually
I, back out if you just find

them somewhere in, in the world.

I don't, at least with the, if
you're storing a lot of information,

I'm struggling to see how you
could store it all in a database.

And then you do have to
start distributing stuff.

I see a lot of people doing even
more storage than us, terabytes

of storage that they need indexed
all and available in a database.

And that's that, millions and millions of
vectors, hundreds of millions of vectors.

That's a lot to store in one location.

But I'm not sure it's tough.

I'm watching it closely
to see how it evolves.

We're a bit closer to the micro deployment
world where, or microservice world where,

everything has its place, everything is
individually deployable, and that allows

us to build what we call the largest
news knowledge graph on the planet.

I don't, maybe for most applications,
it's completely unnecessary.

Nicolay Gerold: yeah, I think you when
you go distributed you should have a clear

reason for it And I think a lot of people
don't and I think a lot of people start

with complexity instead of Starting with
a simple system and then evolving it over

time when I need the performance I need
the extra reliability of a distributed

systems, and I think it's often Because
a lot of the tutorials, a lot of the

resources out there are from big tech
or from people who worked at big tech.

But their needs are very different
from what most organizations need.

Robert Caulk: That question.

I think is what makes or breaks startups
is I see this on LinkedIn and Twitter

all the time, this whole idea of don't
over engineer too early, hack it out

get something out and see if it works.

And you'll re engineer it again.

Once you have a full crew you'll probably
break it down and start from scratch.

So starting with microservice.

oriented paradigm might end up giving you
baggage because you started too early.

At, on the same note, if you're
convicted in your application and

you're building it out so choosing
tech debt from the start, right?

When you hack something together, you're
choosing tech debt because at some point

you're going to need to pay that debt.

That could slow you down at a
very pivotal moment in your, the

development of your startup, where
it's like, Oh, we just blew up.

We're ready for our 10,000 users per hour
to serve them and they're trying, but this

system is just completely falling over.

We're gonna need to block and
pull back to our 200 per hour

that we were able to serve before.

And then in six months, let's
just hope and pray that those

10, 000 per hour come back.

I don't know.

That sort of decision making is
so abstract and so difficult.

It's project manager level stuff.

Even I consider myself not capable
of this level of direction.

I usually Move more towards
let's engineer it right.

Let's build the test.

Let's think about the future.

Let's think about scale and that has
paid off for us, but we definitely

don't move as quickly as if we had just
straight up hacked everything together

in a big model repo and just said,
click play and just hope for the best.

I don't know.

I think, there's.

There's a lot to learn in the industry.

Nicolay Gerold: I think as a
startup, you have to pick your places

where you want to over engineer.

What is the thing that makes you special?

And I think in that you can over
engineer, but in the rest you shouldn't.

And also like, where do you have
experience with what is the tech stack

you have built with before this is likely
also the tech stack you should continue

building with, because when you're
building a startup, you don't want to

learn at the same time, like how Kafka,
for example, works while you're building

your entire system, where you should focus
on the business problem you're solving.

So you should pick and
choose your battles.

And I think also there are so
many really good managed services.

Out there at the moment, which you
should probably pick as a startup,

because when you hit the scale, it will
cost you, but you can still serve it.

Robert Caulk: That's true.

Yeah, choosing the managed service route
that's also something we struggle with.

We love self hosting.

And whenever we do choose to take a
managed service, it took a lot of,

bending over backwards to, or it just,
it, fighting our instincts to say,

listen, we, this is not something we
should be worried about right now.

Let's stop spending 10 hours
of engineering per month.

Dealing with this and move
to the managed service.

This is one reason I think AI
is not taking over anytime soon.

Some of these decisions that we're
talking about here are so abstract and

so broad and wide and require so much
context beyond what can be found on the

internet that there's no, the whole idea
of the AI CEO, it, to me, it's a bit

of a, it's a complete moonshot and it's
not going to happen for a long time.

I'm curious if you agree or not.

But this conversation is evidence of it.

I think it

Nicolay Gerold: yeah.

It's for me like alone
tech stack decisions.

There is, there are so many
tools out there that you.

I wouldn't trust an AI to do it,
because I think based on whatever I'm

building, where I'm building it, and
what for it, and also what do I have

in my team in terms of skill set, it
all steers me into certain directions.

And I think the, like, all AI,
all LLMs have certain biases.

Which in my opinion are being trained in.

Gemini probably will favor Google Cloud.

Robert Caulk: Sense.

It's

Nicolay Gerold: OpenAI will favor Azure.

I think they will, you will encounter more
and more biases when it comes to that.

They will recommend their own stuff.

And when you're building a startup, yes,
it is the easy path of just using it.

Whether it's, or like when
you're building anything in tech.

It's the easy path, but
whether it's the right one.

I wouldn't say so.

Robert Caulk: Yeah.

Yeah, I agree.

Nicolay Gerold: Where are the places where
you actually went with managed services?

I'm really curious, like Quadrant is one.

We already talked about that

Robert Caulk: We self host quadrant
because it's, it is big and it's it

would be, we have our own cluster and
so that, that means that we do a lot of

hosting, but we also have a cloud cluster.

So we, we use digital ocean managed
Kubernetes, which is really convenient

and they have a great S3 that we use.

We have a couple databases.

Gitlab is a good one.

Dealing with runners and stuff,
we've done it for a long time.

And we even still do run Gitlab runners.

But at some point, you might as well
just outsource that sort of stuff.

It's So yeah, I think those
are some of the key spots.

I'm trying to think of some others here.

We still use a lot of, we
use, we do a lot of both.

We self host.

LLMs.

So we self host LLAMO, we self host PHY3.

But at the same time, we're also
outsourcing services to Anthropic and

OpenAI and even Fireworks to get, to help
offload extra bandwidth in certain cases

to get higher quality in other cases.

So those are some of the, on the
services that we'll link into.

Nicolay Gerold: And because we
already touched upon it, I'm really

curious about the knowledge graph to
Quadrant, the architecture behind it.

Like, how are these two components
connected and integrated?

And especially when I'm thinking
through like a query of a user,

like how does it flow through them?

Where actually does the knowledge
graph begin and where does it stop?

Yeah,

Robert Caulk: there is a bit
of, it's a hierarchy of services

that are talking to each other.

Quadrant is handling
the entry point, right?

Quadrant is this really beautiful system
where It's able to get you that general

regional overview of what you're after.

In this particular case, we're
looking at all of the news.

And so we're, to start, we need
to at least get a general region,

because there's no way we're going
to host and upsert a million entities

per day into a graph database.

Instead, you can start with
a vector database, right?

And that vector database is really
powerful at searching through hundreds

of millions of vectors and then filtering
those vectors upon nodes, right?

And then basically your node types.

At the end of the day, a graph,
a knowledge graph is simply

metadata with node names and then
the relationships between them.

There's nothing stopping you
from using a Postgres or a

Quadrant to just store metadata.

Even in NoSQL, MongoDB, just filter
on and index the parts of your graph

that you want to gain early access to.

before even the graph is constructed.

So this is basically the idea, the
entry point of the query is, Hey, let's

at least narrow it down before we get
into the nitty gritty of multi hop

querying and applying cipher in, in some
way, some kind of sophisticated way.

So at least that gets you access
to a lot of the information.

A lot of this is just, are just
UUIDs, like I talked about earlier.

And so some of the data is not
necessarily stored directly in Quadrant.

And so that also makes things a
lot faster and more efficient.

Because again, you don't necessarily need
all of that information sitting in RAM.

You just need the indices sitting in RAM.

With actually with Quadrant.

Those indices can sit on disk, which
is really powerful because it opens

up the ability to make massive indices
without paying an arm and a leg.

So some of those indices are absolutely
helping build out the final graph.

First we're getting a, we're aggregating
a bunch of information using Quadrant

S3 and then able to then build a graph
on the fly from that information.

And so that graph, when you're building
it, that's the more complex part, the

part that we've put our research into,
how to build that graph on the fly in

a very A, efficient way, B, accurate
way and and then once you have that,

you can upload it into a GraphDB
because now you have this small graph.

It's really agile.

It's really mobile.

You can upload that
into whatever you want.

You can put it into memory with Kuzu.

I don't know if you've heard of
this, of Kuzu, or you can put it

into Memgraph, which is really
powerful, or Neo4j, whatever you want.

Because now you just
have your triples, right?

We've gotten the triples, but in order
to get to that subset of triples,

we had to take this kind of journey.

And I think I don't see any other
way that you would go around about

doing something like this, because

again, you can't just upsert
entity, millions of entities

a day into a graph DB.

They're very heavy,
computationally expensive.

They're not designed
for this sort of thing.

Nicolay Gerold: I would love to
make it a little bit more tangible.

So I'm like a user, I have to question, is
Kamala Harris going to win this election?

We know it by now.

But the query comes in.

I'm assuming you're searching
over some form of document.

Robert Caulk: Yeah, then we can
essentially do that query into the vector

database, into Quadrant, in this example.

So we say, okay I want to
know about Kamala Harris for

the presidential election.

Okay, query Quadrant, get all of
the related documents to that.

Now, there are multiple
ways to query AskNews.

If you want to just use the
natural language query, you can.

If you did want to attach other
metadata filters, you absolutely could.

And in that case Let's say I wanted
only Kamala anything about the Kamala

in the presidential race that mentions
Putin let's go with that, right?

Because that would be your start
of building a graph about Putin and

Kamala, because let's say you want
to get all relationships between

Putin and anything Kamala related.

The first, the step there would be to
actually fetch all the documents that

mention Putin with respect to Kamala.

So that's essentially, you're, we're
able to get those documents first and

then build out a graph related to that.

Does it make sense?

Nicolay Gerold: Yeah, so
basically you have embedded,

for example, all the news items.

I'm running my search over it.

I'm getting the documents and
then I'm constructing a graph.

But how does the graph
actually now come in?

What value does it give to me?

Why am I not just returning
like the top hundred documents

and say I'm done with it.

Robert Caulk: Depends on your context
engineering and what you're after,

or it depends on if you even need a
graph, the graph can come in handy in

certain circumstances again, because
Let's say I got a thousand articles.

In this region, and now I do want to
start doing complex Cypher queries.

Okay, now I have my triples.

I create my on the fly db.

And now I can actually
query it with Cypher.

And do, okay, I want all relationships
that all of these types of relationships

within two hops of Entity X.

That's a sophisticated Cypher query
that you cannot do with a traditional,

Postgres or Quadrant or something.

So if you need to make that type of
query for your extraction of information,

then you would need that, but you
don't need to if you don't want a

graph, you don't need to, we don't
need to build you a graph, right?

That's a, you essentially, we have
a pipeline of enrichment and you can

dip in at whatever level of enrichment
you want, if you just want the raw.

You want to treat us like a typical
news API where you just grab news

docs out and you're on your way.

Great.

No problem.

If you want us to construct
your graph for you, no problem.

If you want us to construct your graph
and query your graph for you, no problem.

So like essentially that
enrichment pipeline is.

allows people to select what
they need and when they need it.

Nicolay Gerold: Yep.

What would be a good example for
a query where you actually have to

construct the graph, query the graph,
but then also, I'm really curious

also about the re ranking part.

Like, how do I then combine the different
signals from the graph, from the semantic

search, and probably also the keyword
search, which you're running as well.

Robert Caulk: Case of our graphs,
of our news knowledge graph, would

be an analyst trying to explore
relationships of some topic of interest.

They're trying to, here's an example Love
Justice is an NGO in South Africa, they're

trying to combat human trafficking.

One of the ways to do that is to
identify entities which have cropped

up, in relation to crimes that
maybe are related to this topic.

So they can build a graph out
and go select that entity, but

then get two hops to that entity.

And all of a sudden, now you've got
semantically different information

that's related to it, which allows you
to identify a hidden relationship, right?

Because let's say This shop a crime
occurred at this shop and so they managed

to uncover some crime at this shop.

That's in the news there's another
article somewhere that mentions that

this shop Was struggling to get supplies
from this organization and then that

Organization has some crime associated
with it and all of a sudden you've managed

to potentially connect dots that were
not immediately apparent by just doing a

similarity search On that initial shop.

So now you've basically identified
that, Hey, this is interesting.

This organization had a crime that
clear that may have a connection to this

other crime that occurred at this shop.

And I could only get to that
by this knowledge graph.

Does it make sense?

Nicolay Gerold: Yeah, and I think
you could probably do it both ways.

Use it to enrich, but also use
it to basically narrow it down.

Because the relationships aren't there
that you're actually looking for.

What I'm really curious about, we talked
about in conversation before that you

basically normalize everything to English.

What I thought about immediately
is how do you handle entities

or relationships or concepts?

That do not translate well.

Robert Caulk: That's a good question.

So we're, we are tracking, I think, 17,
maybe 19 languages now, and we only track

languages that we know can translate well.

Now that doesn't mean, that still doesn't
technically answer your question, because

you said maybe a lot of, the majority
of Russian translates well, but there

are certain entities that just don't.

They don't translate and the
beauty is a, you can, if it doesn't

translate a lot of times, it won't
translate an entity name, right?

And so we can still have that
original name in the database.

And especially if it doesn't
translate well, it's probably just

going to keep that original name.

There's no other way.

There's nothing else
that it can do with it.

If it can't translate it.

So we can still hang on to
entity names that are in other

languages, which is great.

But when it comes to text and stuff
and expression of information, there's

absolutely some information lost.

And that's unfortunately kind of
part of the game that you're playing.

If you want good search, you want
good relevant search across French.

Swedish, Russian, and
Italian, all at the same time.

I want to say, give me all of the latest
information from all of these languages

related to the European politics.

You could go in and try to do multilingual
embeddings and then you have multiple

vectors potentially, or you have
one embedding model that you think

is great for just those languages.

It's going to get messy fast, but
if you've normalized everybody

to one language and then embedded
now you're operating in a

much cleaner parameter space.

At the end of the day, we're just
talking about parameter spaces.

And that parameter space is, hey, my
Swedish is, my Swedish report on European

politics is now sitting in English.

My Italian report about European
politics is sitting in English.

And therefore, when I make an
English query, those things

are coming out equivalently.

We've now reduced bias, right?

So we've mitigated some bias
towards a language because we're

not actually biasing to a language.

Yes, in general, we're sitting
in an English translation

bias, which we fully admit.

Okay but it's one of the better bias
languages to be sitting in, since a

lot of the LLM world is operating in
English, and prompts are very receptive

when you start interacting in English.

There are absolutely sacrifices to answer
your question directly, but in some

cases you may be surprised that if it
doesn't translate, we're keep, we're just

holding on to the original entity name.

The rest we get as a
nice, balanced, Retrieval.

Nicolay Gerold: Yeah, and to
really drill it at home, I built

something similar once and I tried
to do it in multiple languages.

And It's, I spent nearly a month trying
to get entity deduplication to work.

Because you then, if you have basically
multilingual entities and relationships,

you basically have to know, okay,
which entities and relationships are

basically equivalent and which are not.

And when you're dealing in multiple
languages, it's significantly

harder than just in one when
you're already in English.

Robert Caulk: So disambiguation, the
DUP deduplicate, which is actually

slightly different than deduplication
was a research topic, is currently

a research topic for our laboratory.

'cause we are just an
applied research lab.

And so we're doing research on these
topics and injecting it into the tools.

And that's part of the offering.

It's oh, use our service and then
you benefit from all of our research.

One of our research topics for, I think
it was at least a three or four month

research stint for two researchers.

Was how to best disambiguate
across multiple languages

across different topics, right?

It's even not even just languages.

It's topics.

And what we found is that a we found a
really nice method Which works very well

the majority of the time And so that is
the default approach that you'll get when

you interact with the knowledge graph.

But you still need to, you
still need human in the loop

interaction some of the time.

And it may not even be
because we were wrong.

It may be because you, as an analyst,
want these two things disambiguated.

Even though there's no reason beyond your
particular use case to disambiguate them.

So we need to give you as the user the
ability to then correct the disambiguation

or constrain the disambiguation.

So that's part of the tool is like me as
an analyst, when you use our newsplunker

dashboard, we'll give you, we'll
build your graph for you right there.

You're visually looking at it and
okay, for my particular use case,

I've got a Trump here and Elon
Musk and they're separate nodes.

But actually for my use case, I want
them to be the same node because I

just want all relationships that touch
either Trump or Elon Musk in this case.

And so I'll just disambiguate
them as the same thing, right?

So that would be a perfect or your
particular use case, but our system

would still disambiguate automatically.

President Donald Trump former
president, Donald Trump.

All of that will turn into one node
already based on our research, because

that's about the extent of where
we can drive that disambiguation.

But if you then want to pair Melania with
it or you want to create Trump family

node or whatever out of all of them,
then we make that functionality available

to you because that's how, that's
just what analysts asked for, right?

When we were building it, a
lot of the tool is just because

analysts needed some functionality.

Nicolay Gerold: Yeah, and a
lot of that comes also into

play with entity recognition.

And what I'm really curious about
is how do you handle new domains?

So just as an example I haven't
handled any legal publications yet.

So I'm suddenly integrating legal news.

into your knowledge graph.

The entities it extracts and
also the meaning of the entities

is likely wrong or is different
from what you've already seen.

So how do you handle that?

Robert Caulk: Yeah.

I love this question.

So we, you, first of all there's multiple,
there's like a tiered response to this.

And we've solved it.

That's why I like the question so much.

First of all, let's start with the
model that we trained for entity

extraction and released on HuggingFace.

It's called Gleiner News.

We fine tuned Gleiner, which is a
very smart architecture based on BERT.

Now this architecture allows you to,
or it has extrapolative abilities.

That doesn't mean that it's able to
perfectly move into a new domain, right?

Of it's never seen law, and now it's
able to do law without seeing it.

So you're right.

Your intuition is a 100 percent on.

On the head of the nail, but if you
have trained it for a variety of topics,

it is very good at extrapolating and
that's, and we've proved that with our

research paper and I can link you to it.

It's called curating synthetic data with
grounded perspectives for equitable AI.

And the whole point is.

How do you train an entity extraction
model to be good at extracting across a

variety of topics, a variety of countries,
a variety of entity names, right?

Law, sports, Kenyan organizations
Chilean business names and

Brazilian people political names.

The only way to really prepare your
model for new incoming entities.

is to make sure that it has seen as a
good representative sample of the world.

So in this research that we did,
which was probably more like six

months of research, it was how do you
engineer that underlying data set?

To be representative of the types of
topics that you expect to encounter

when you're actually deploying
this entity extraction model.

And so essentially, we have very good
diversification techniques across

topic and continent and country and
even page rank and trying to really

dive deep into the diversification to
again, fill out the parameter space

that we're planning to operate in.

And that works really well, especially
if you use the proper disconnect between

your train eval and validation datasets.

If you're, if you don't have leakage
your eval and validation are actually

going to give you some insight
into how this thing is going to

perform on entities is never seen.

Part of what we split on between
these datasets is time because that's

where you'll see new entities arrive.

And luckily, since we've engineered
the context well in the past.

We can split on time wherever we want
and that enables very powerful training

and very Strong conviction that this
model is going to be able to adapt to

new domains Now the second part of the
answer beyond that is that we retrain it

on new information so Every six months,
we're retraining just to make sure that

the latest entities in the news are in the
training data, because that makes it even

better at entities it's seen before, and
it makes it more prepared for more similar

entities that might be entering and new
topics which are becoming more important.

And more important, for example,
when we trained the initial glider

a lot of, we, we certainly did topic
diversification, but there's no doubt

that there's a large percentage of
new of data related to Ukrainian war.

Israel, Gaza, right?

These are topics that were present,
and that means we're going to be

good at extracting any entities
in related to those topics.

But as new topics, new, hopefully not wars
break out, but new general trends and new

ideas and subjects become more prominent.

We adapt.

So that's the second layer of it.

We retrain and redeploy.

And so now when you're using our system,
you don't have to worry about any of that.

You get all the latest entity extraction.

You can rely on the quality of what we've
built because we're using our own model.

The final point I'll make is this model
was top 10 downloaded model in 2024.

So I think we eclipsed like 8.

5 million downloads.

And that's not because
people care about diversity.

It's because it works, right?

That's it.

The way we approached the
training of this model means

that it actually works very well.

And there's a lot of need for
entity extraction, and news

happens to be a really diverse
representation of information.

We've got science about bio,
science about mathematics, right?

People report on all topics.

And so you have a good
vocabulary, base vocabulary there.

So it's called Gleiner News, but it
turns out it's really usable across

almost all domains just because of
the nature of the topic diversity and

how we've injected it with the news.

Nicolay Gerold: If you have the time,
I would have two follow up questions on

Robert Caulk: Go for it.

Nicolay Gerold: It's, maybe
first on the continual training.

How do you actually
approach the training mix?

Because there have been a bunch of studies
on that, like what's the optional mixture,

and I think it's still pretty unknown.

I think PrettyBase always, or it advocates
for basically 50 percent from the old

training set, and 50 percent of new data.

To not bias it too heavily.

Do you have anything, any tricks

Robert Caulk: We have some configs
where that we've learned works well.

A lot of what we do is human in
the loop looking at data, right?

This is one reason things work for us.

Thanks.

Yeah, we use evals.

We even track general
benchmark, benchmarking data

sets while we're training.

But at the end of the day, sometimes
it can do well on, on a training.

But if you go look with your own
two eyes at a lot of the output,

you may not actually be as impressed
or happy about what's happening.

So a lot of what we do is we
look with our own two eyes.

And I, when I say we, I should
say Ellen, who's our Director

of Transparency Research.

This is transparency research
at its finest, looking at

it with our own two eyes.

And that means, yeah, okay,
taking the data, looking at

statistics about that data.

Visualizing various dimensions about how
it's being, how it's represented, how

does, how is the, what, where's the error,
looking at the error itself, identifying

where the error is and deciding, is
this error that we like or don't like?

Why is it error?

Sometimes you it just crops
up as error in your training.

But the reason for the error may
actually not be as bad as you think,

or it may be worse than you think.

And so you need a human
to just go in and look.

And so at the end of all of that,
we end up with a config and we say,

Hey, this is the split that worked
at the end after all of our retries.

This is the heuristic we're using for
our splits across diversification.

Like one decision, I'll give you
an example of kind of an arbitrary

decision that just had to be made.

We just don't have the same amount
of data coming out of Kenya as

we do of the of UK and France
and USA, but we still need data.

So how do we decide the amount
that we can't break down the United

States all the way down to the
same amount of data from Kenya.

We'll lose a ton of data, but
we don't want to buy us too

much to the United States.

And so this comes down to
a bit of a, Hey, we think.

Generally, we're going to cut off these
major countries at this percentage and

make sure that the tail is smooth and
that, yeah, okay Bangladesh, we have a

very small representation, unfortunately,
but at least we're guaranteeing that

the distribution fits what we would
expect the general representation to

be from a just reasonable human being
perspective, and then it ends up working

when we apply it to Kenya, right?

Checking.

Okay.

Okay, we, Kenya didn't have a great
representation, but when we apply

it to the Kenyan information that
it hasn't seen, how well does it do?

A lot of the decisions are kind of art
in some ways which I hate to say it, but.

That's a moat, right?

We couldn't find real, hard,
quantitative solutions to some of

the questions you're answering, or
the questions that you're posing.

And the fact that you're posing the
questions indicates that you're locked

in, like you, you're, we should probably
hire you to start working for us at

some point, because those are the
hard questions to answer, honestly.

Nicolay Gerold: Maybe another hard one.

How do you think about like slop?

So basically the AI generated content
that probably doesn't carry much signal.

It doesn't carry any diversity
and also isn't really fresh.

Robert Caulk: Yep.

We deal with this a lot because a
lot of what we're doing is, scraping

open web information and some of that
open web information is indeed sloth.

It's just total AI generated crap.

I don't know.

There's a bit of a philosophical
debate here about if it's going

to break the Internet, right?

I think that's what
everyone wants to know.

Are we going to just deteriorate into
this idiocracy of AI generated madness?

And I don't prescribe to that mentality
only because people are directed towards.

Information that they buy and not buy
like physically, but they like their,

they agree with, they, they resonate
with what's being described there.

And I don't see people
resonating with AI slots ever.

And so that, that content doesn't generate
value anywhere, and therefore it doesn't

warrant its existence on a server.

And really, it costs money to hold a piece
of information on a server on the web.

And if it's not generating value, it's
not going to sit there very long, unless

it's getting traffic and generating value.

So I think a lot of that slop
is just always moving, working

its way out of the system.

Unless someone really wants to waste
money hosting a ton of slop, I'm

seeing still a lot of great writing
coming out of a lot of places.

And even we, as a company that uses AI,
We hire human writers because you need a

human if you're trying to Generate a piece
of writing that hasn't been generated

before you can't use an interpolation
machine to get to that answer You can

use an interpolation machine to help get
you to that output but you need a human

to connect that dot between what I as a
company want to express or an organization

and what I as a Person believe the
audience is ready to read And I don't see

slop connecting those dots anytime soon.

And that to me, it gives me comfort
that, okay I'm paying writers.

So I'm going to be still
contributing to normal writing.

And we are AI first company until we
see some true slop, like overload.

And I'm at least.

Optimistic, but what are
your thoughts on that point

Nicolay Gerold: I think it will
become more and more content, and I

think it's the same as it was before.

Our recommendation systems are very good.

If you decide to engage with Slop,
you will probably get more Slop.

And you create your own echo chamber,
and I think you have to curate and the

value of curation probably will go up.

But it already is, like the amount of
content that's already created, probably

a factor of 10 won't make much of a
difference for the singular person.

For me, like when I'm writing stuff.

It's like my quality bar is basically
I take all the context I have, throw

it all into Claude, Gemini, whatever,
and give it like really instructions

how to write it, what to write.

And when it gets close to what
I've written, I just throw

it away and don't publish it.

Robert Caulk: when it gets close?

Oh,

Nicolay Gerold: to, I, so
I have written my blog.

I'm finished.

I don't really write with AI, maybe for
some phrases and stuff and rewriting.

And there I feed all the input into
the LLM and when it gets close to

my writing, I just throw it away
because I assume it's nothing new.

Robert Caulk: Interesting.

Okay.

Because it's managed to,
it's only interpolated.

So you're on my side of the fence of
if an AI was able to interpolate to it,

it's not interesting enough for the web.

Interesting.

Okay.

I like that.

I like this.

I'm going to share that
anecdote because it's, it.

I think that's a really good
test, and I had not thought

about using that as a test.

Nicolay Gerold: it's a, it's an
interesting quality bar, I think.

Robert Caulk: that's cool, interesting,

Nicolay Gerold: Nice.

So what would you say what do you think
is missing from the AI and data space?

What would make your life
way easier if it would exist?

Robert Caulk: As a consumer, myself,

Nicolay Gerold: Consumer,
developer, business person

Robert Caulk: I'm still not seeing, I'm
still, maybe I'm a, I'm not seeing it, but

interaction with the repositories is
still not at the level that I think

I would like or need in order to
expedite my process beyond Copilot.

I think Copilot is great.

It gets the job done for a lot of things.

It expedites the workflow.

But I think of a much deeper understanding
of the repository is something I would

love to see something where I'm able to
onboard an engineer very rapidly into

a very sophisticated set ecosystem of
repositories where they can talk to them

and ask questions and get real answers.

I think a lot of questions when
you're talking to a repository.

never gotten great answers out when I,
this this whole idea of talk to your

GitHub, it's always kind of something
that, that satisfies what you're after,

but it's not that it's not really
getting to what I need and I end up

searching myself and then solving it.

I would love to see an engineer be
able to onboard and actually talk

to a set of repos and get that out.

Maybe it's happening with
a product that I've missed.

I think AI is moving quickly, so
I don't know if you know of one,

I'm happy to get recommendations.

Nicolay Gerold: I don't, it's two
thoughts on this, like how I do it.

I go through the commit history
and the issues and look at

how the project has evolved.

Because often it concentrates on certain
area where there is the most activity

and that's the most important one.

So you should probably
focus your attention there.

The second part is I think we need coding
tools that behave more like systems.

So we have a lot of
different components already.

We have.

Good code search.

We have tools that can answer
questions decently well, we would

have tools to construct context.

But no one, in my opinion, has
put it together in a good way.

And I think you have to construct it as a
system, because I think this also evolves

into like how agent would probably use it.

Based on a query or like a task I want
to do, first I should decide, okay,

What do I actually need to do here?

Do I need to search or am I just writing
boilerplate code that I can just use

the LLM and generate immediately?

When I'm asking conceptual questions,
you probably want to pull in additional

context from the commit history, from the
issues, from discussions in the forum.

And I think you have to basically have
different components, like microservices

architecture, probably, in the end.

And you're pulling them in on demand.

And I think there is no tool
out there yet which does that.

And I think in the beginning
it will look like a workflow.

You probably have a classifier, you have
four or five different workflows based

on the complexity of the task or the
type of questions, and they are executed.

And over time, it can become more and
more authentic, but I think that's really

missing from coding tools at the moment.

Robert Caulk: I like that idea of
using commit history and context.

And I can see how going and finding the
relevant commit messages themselves,

and appending that to, and saying,
hey, here's the, here are the commit

messages, here are the commits.

I can see how that can help.

We, maybe we need proactive
context engineering.

So what about this?

What if we took some, basically, as
you make your commit, you type your

commit message, but at the same time,
as you submit it to the repository,

an LLM ingests it, and then builds out
fleshes out a much better description

of what occurred in that commit.

And maybe even, I don't know, it does
some other indexing across different

metadata points of this repository and
this branch, those sorts of things.

So then it's like proactively
engineered the context.

So then when in the future, when I want
to talk, I didn't, I don't, I can now

talk to a lot more sophisticated text
without even getting to that code.

I don't know.

Sounds like something we should build.

Nicolay Gerold: I think there are
a few who do like this automated

commit message construction.

I'm using better commit the tool
to write the commit messages.

And it basically forces you to
decide, okay, what scope is it?

Is it like feed, refactor, performance
then write a description, write a title.

I think like the description
is the biggest part because

that's the biggest annoyance.

Robert Caulk: I never write descriptions,
so that, that's would be great to have.

Nicolay Gerold: As a second part,
I think is how can we actually make

easier to find the right stuff.

And I think what I have Gotten into is
I start, when I start a new file, the

first thing I'm writing is a doc string
at the top, what I actually want to do

and what the interface should look like.

And this is something that helps a lot.

Like I often don't even need to
like pull in any additional stuff

into the context window because
I've already written it out.

And I think like these, like commenting
more, not for other humans, but for

LLMs, I think this is really not
the practice we should do for humans

because it gets a little bit annoying,
but like comments for LLMs to actually

help them generate the right stuff is
a really interesting habit to get into.

I'm

Robert Caulk: I wonder if some of the
tools go in and auto comment repositories

like a tool for talking to a repo.

I could imagine that during
that, in the initial indexing.

One strategy could be for the LLM
to go into each file, each function,

and itself make a little description
of what this thing does for future

LLMs to then come interact with
this like more indexed repository.

That's maybe that's one way to do it.

Nicolay Gerold: not sure about
that because what it does should be

obvious from the code I've written.

I think you should document
the decision making behind it.

What have I already tried?

Why am I choosing to do that?

The why and not really the what and how.

Because code is a way more precise
language than regular language.

What you express in your code
should already explain the what.

Robert Caulk: And the why is something
that AI will never be able to do.

Which is interesting.

So yeah, maybe you're right that level
if we start training engineers to be

a lot more careful about the why in
the commit message, in the description

and in the comments the LLMs can
fill in the rest of the dots, right?

Like you said, everything else is
easily filled in, but the why that's

the project manager that you're
trying to build into the system.

Why did we actually make this feature?

Then if the con that context is very
powerful, extra dimension that I

honestly had not thought of before
this conversation that for sure, if

the LLM has the why it's able to reason
on top of it and therefore probably

make pretty strong suggestions about
new features and potential bug fixes,

because it knows the thought process of
why the external context of why a lot

of this information became into being.

It's like a root, it's like basically
a root context engineering, like making

sure that your context has a root
into the most important humanity, the

like humanitarian root of the context.

I don't know, I haven't thought
about this, but I want to try to

find a way to get this into the news.

This idea of like, why
was this news written?

Not what is the news?

But why was it written?

That would be really nice to have.

Nicolay Gerold: What's
the motivation behind it?

Robert Caulk: Yeah, that
honestly will open up a lot.

And in some ways we can
build those sorts of tools.

But, and there are organizations
that focus on where money

comes from, money talks.

But those data points are hard to come by.

So context engineering is hard.

Let's put it that way.

Nicolay Gerold: Maybe on that vein.

What's next for you guys?

What are you building on
that you can already teaser?

Robert Caulk: We're about to release
publicly alerts, which where you just

naturally describe what you want to track.

And then you say, hey alert me if
if a protest breaks out in Paris and

send me a Slack message and a discord
message and fire me off an email.

And then in the background.

We start tracking the news to see
if a Paris broke out if a protest.

Broke out in Paris.

And then as soon as it does Slack
messages and emails go out we call

it natural language alerts and
we're really excited about it.

So that's the big teaser.

Nicolay Gerold: Yeah, it's at the
moment, a little bit of a pet project

of mine is basically an event stream
where I have a vector search running

on top of and like vector search, rule
based classifiers and stuff like that.

I'm trying to pull it out, but it's
a really interesting topic, basically

Kafka, but for unstructured data
like stream analytics, but it's

Robert Caulk: But what's the data source?

Where's the data coming from?

Nicolay Gerold: it's, I'm just
mocking it up at the moment.

So I'm just using basically
different RSS feeds of.

Yeah, news channel.

It's like mostly technical news, technical
content and trying to get notified.

But something pops up
that is of interest to me.

And it's, I think it could be used
in a lot of different areas and ways.

But the technical aspect behind it is
really challenging because stream analytic

operates mostly with really simple queries
which can be expressed in a Pretty basic

SQL and it's like doing it on unstructured
data and especially unstructured data

that is very large like news items, blog
posts is a little bit more of a hurdle.

Robert Caulk: I should let you hook into
the system to see if it helps, but yeah,

no, that's a big aspect is like, how do
you get to the content, basically cost?

How do you reduce the cost of the,
of tracking the news with an LLM?

That's more or less what
it all boils down to.

Cause you could feed everything
into an LLM and just keep asking

it the question, has this occurred?

Has this

But.

If you do the right context engineering
with the right retrieval and you've

processed the information properly,
then it starts to become a lot

cheaper, and that kind of goes
back to that processing pipeline.

So I'm happy to geek out in a future
session about how to do event tracking.

Nicolay Gerold: yeah.

I would love to once I get
it working completely nice.

And if people want to follow
you, check out what you guys are

building or use it or follow your
research, where can they do that?

Robert Caulk: Yeah.

So the company is emergent methods and
we have a website emergent methods.

ai.

That's where we are.

Our research is located.

The data source is called ask
news, the context engineering

as a service that's asknews.

app.

So head over there.

We have our own editorial,
we have an editor in house.

So we believe that the content that we
produce is the least biased in the world.

And it's a nice global news tracking.

Check that out.

But we're also on LinkedIn where
you can go to ask news on LinkedIn

and Twitter and stuff like that.

But thanks for having me.

I appreciate the chat and
the questions were a lot more

sophisticated than I expected.

So I appreciate that.

Nicolay Gerold: So what can we
take away if we want to apply that?

I think, in general, building
an LLM project is first about

getting the basics right and then
iterating until it's good enough.

So the first step should always be
define your problem and you should

spend a decent amount of time there.

What is the problem you're
actually trying to solve?

And then go into the data.

Examine and label the data and manually.

answer it.

So manually go through a few inputs
and actually try to derive the correct

answer and not even derive like a
solid one, but like a perfect answer.

And if you can't do this yourself,
get an expert to help you and try to

also understand his thought process.

Like what are the actual
steps he's going through?

What are the Okay, specific items he's
looking at and really make him explain

his thinking and note it down and
also try to uncover okay what is the

additional context he is using when he
is Creating his answer, like what's the

context that's not on the page, not in
the input that he has, but you or the

model doesn't have, and this allows you
to really understand, okay what's the

data quality I'm working with, what are
gaps in the input data in the context

that I actually have to fill, because if
he has The necessary context in the set,

the model doesn't, so we need to fill it.

So we need to figure out a way, how
can I give the same information to

the human has to the model and then
try to identify like the key feature.

The decision points and also like the
different steps he's walking through

because if you can solve it one shot
This will allow you to basically

break it up into several steps Which
in sequence solve the overall task?

Then the third step is already like clean
and engineer the context like test it

out a little bit create a basic prompt
Which takes in the input data, but already

spend a little bit of time actually
Thinking through the preprocessing

pipeline, like how do I want to change
the appearance or the format of my data

so the model can make a good decision
and also try to like strip out some noise

and form it in a way that makes sense.

If you have chronological
information, do a timeline.

If you have just general
information, use Markdown.

If you want to compare and
contrast, use something like Tables.

And then you can actually
move into the actual model

selection and fine tuning phase.

And fine tuning, I don't mean
actual fine tuning the weights,

but rather adjusting the prompt.

So it can generate a good output.

So pick an LLM that fits your needs.

And often you have additional constraint
in here that actually determine a lot.

So for example, if you have a really high
concurrency, you're likely stuck with

open AI and can't really use anthropic.

But for things like coding, for example 3.

5 sonnet is just better and
use your label examples.

To actually test different versions of
the model and what I always I start with

a really a bigger model like nowadays This
would be something like all three mini.

I test it and Then I would think
about okay If it is going really

well, I actually move down.

So I use the same prompt, the same
input, and use, for example, 4.

mini because it's way cheaper.

And when I can get the task done
with a cheaper model, I will do that.

And this will basically already
allow you to test, experiment,

and fine tune the prompts.

What you will likely encounter is
that the model isn't able to solve

all your input examples one shot.

So you're likely We'll spend some
time adjusting the context and the

prompts and here really try to try the
different context engineering techniques.

We talked about construct the context
in different ways, construct tables,

construct timelines, construct markdown,
add additional instructions okay,

think step by step and detail the steps
that you've got previously from your

conversation with experts, and tweak it.

Until you find decent enough results.

If it really takes you a long time
to get it in one prompt, you likely

have to decompose it Or add a few
shot examples If you use few shots, I

would always be a little bit careful.

Then you have also examples in your
test set that are not too similar to

your few shots because then You can
actually be a little bit more sure

that it can extrapolate to what it will
see once you put it into production.

And after that, it's basically
just iteration and evaluation.

So test your model
continuously, make tweaks,

evaluate it.

When it performs better,
put it into production.

In production, monitor
the outputs, flag cases.

Which might be wrong,
which might be correct.

If you can't collect user feedback, try
to figure out a way to make it easy.

So that the user can give
feedback, even if it's implicit.

So often, for example, for chatGPT,
I think a really good signal is

if the user copies out the data.

So if you can figure out something like
that, I think that's already a pretty big

step that you can automatically evaluate.

Okay, what are really good outputs
and what aren't good outputs?

Also, like if you have a chat application.

Or if you have anything where it's
more like generation based, if he

regenerates, like this is already
a pretty bad sign, like the output

was garbage and you can already
flag that for human review as well.

And through that, you basically collect
new input output examples, which you

already flagged with like positive
and negative, which you can add either

to your test set or which you can use
to tweak the prompt and see whether

you can improve your overall system.

And lastly plan for the deployment.

Think about how you will
actually serve the model.

And where what additional
things are necessary.

For example, if you're based
in the EU, you likely have to

set up something like Azure.

So you can use the models that are
actually hosted in the EU as well to

comply with all the regulations we have.

And also consider costs response
times and how to monitor for errors.

So like really if you have to have
low costs or you have to have really

fast response times, you likely
cannot use the biggest models.

Or, like in the new O3 mini case, you have
to configure it to use less reasoning.

So you have three different
levels of reasoning in O3 mini.

So you likely want to use the smallest.

And also try to figure
out a feedback loop.

So how can you actually Get feedback from
your users without any effort on their

side, so you can improve the system.

And this is often an

overlooked aspect,

which is a cross department effort.

So you often need the designers,
front end engineers, product managers

involved to actually figure this out.

How can you actually set this up?

But this will probably be
the most important step.

Because Data advantages and feedback
loops are probably the biggest advantages

you will have in software in the future.

Then it's take the stuff we talked
about, like for context engineering,

like curating input signals,
give it a structures, refine it,

balancing the breadth and position.

So basically treat the
context we know as precious.

So every token counts and don't
throw everything in there.

Actually spent some thought, what
should be in there and what shouldn't.

And also don't be afraid to take away.

I think like the default knee
jerk reaction, most of us have.

It's like when a prompt isn't working,
add more instructions, add more

stuff, sometimes also like removing
irrelevant instructions or redundant

instructions already helps a lot.

And if you want to move in the
entire knowledge graph space,

it's, I think a knowledge graph is really
complex if you want to get it to work.

So if you want to construct a knowledge
graph, I would first start with metadata.

A knowledge graph usually you already
have a vector search in place.

So I would first start with metadata
and start tagging the different items.

For example, I would use an entity
extraction to identify the key elements

in the text, people, organization,
events, concepts, but make it really

use case specific, like what's relevant
to your specific use case, because

then often you don't even need a really
good entity extraction model, but

you can get away with regex matching.

If you have a set of entities which
are really important, take those and

do a regex match on the different
tags and actually put it into an

additional metadata column called tags.

This you can use for
pre or post filtering.

And with this you can Avoid the additional
complexity of using a knowledge graph

in the beginning as you're building
out your general capabilities.

And after that, you actually can start to
think about, okay what are the different

relationships of the different entities?

What entities should we add?

And.

This will take a while, and I would
always start really small, make it really

specific to one use case, focus on the
key relationships, the key entities you

face there, add them, create a knowledge
graph, and then expand it as you go.

I'm not sure the knowledge graph that Ask
Noosehatch has, which is ever expanding

and has an infinite size, potentially I
don't think for most use cases, this is

The right way to approach it, because
you're building, you're likely building

something really specific and So you
should actually make the knowledge graph

specific for your use case as well.

At least that's my opinion.

And lastly, I think like the advanced
querying, like using graph query

languages like Cypher or Gremlin or
Turtle, I think in the beginning it's

a waste of time to even touch that.

Construct your graph, store the
relationships in whatever database

you are using at the moment.

So Use Quadrant, use Postgres, and when
you hit a point where you actually use

the knowledge graph this much that it
becomes a performance limitation, then

you can start to think about, okay,
I need something more custom but I

would always opt for something like
the simpler solution, especially in the

beginning, if you don't really know is
the knowledge graph Will it work and

will it have a benefit on my application?

You don't want to over engineer it.

You want to test it with as low effort as
possible and with really low complexity.

So figure out a way to actually add
it to the databases you are at the

moment already using, and you can.

Integrate a knowledge graph in like
nearly every storage format or day.

So make your life a little bit simpler
because in the end, the biggest waste

of time, if you spend like weeks
figuring out like knowledge graph

databases and graph query languages,
and then you don't even use it because

you just saw all the advanced querying
is just a waste of time, or it's like

too much for your specific use case.

Yes.

This was my podcast with Rob so if you
listened until here, I can just ask

you leave a review, leave a comment
especially on Spotify and Apple, but

also on YouTube, it helps out a lot.

Also, if you have any topics you
would like to hear more about,

let me know in the comments below.

Otherwise, I will catch you
next week when we will continue

with more on knowledge graphs.

View episode details

Listen to How AI Is Built using one of many popular podcasting apps or directories.

← Previous · All Episodes · Next →

#041 Context Engineering, How Knowledge Graphs Help LLMs Reason

Subscribe