S2E25

#042 Temporal RAG, Embracing Time for Smarter, Reliable Knowledge Graphs

February 13, 2025 · 01:33:44

Nicolay Gerold: Time is probably one of
the most ignored dimension, especially

within rag and knowledge graphs.

But time shapes the data
validity and also the context.

So every data point exists in a
temporal context that affects its

reliability but also its usefulness.

And a common pitfall, so for example in
cyber security, when we use a week old

threat intelligence, we can miss new,
for example, zero day, zero day exploits.

In legal, if legal teams reference
an outdated contract or outdated

regulation and laws, we might
risk really costly disputes.

Engineers which work on a stale API
documentation introduce bugs in their

system and the API contracts break.

So we really have to like
timestamp all data operations,

ingestions, updates, observations.

So at the least this is usually
like three different columns.

Created that, updated that, deleted that.

For.

The different rows within our tables,
and if we can, we should even try to

implement something like versioning where
we actually keep the old data as well.

So instead of updating the existing
record, we create a new one with the

updated data and mark the old one
as deleted, but keep it in there.

So we have this lineage

and across this, we really, we shouldn't
treat data as perpetually valid.

So in today's conversation, we'll be
talking to Daniel Davis, who is working

a lot on temporal rack, how he calls it.

So basically how we can bring a time
dimension to retrieval, augmented

generation and knowledge graphs.

And he puts data into
three different buckets.

And each of those requires
different handling.

The first bucket is basically
observations, which are like

measurable and verifiable recordings.

So, it's for example, I'm wearing
a hat which reads Sunday Running

Club, and it might be like Daniel
notices, okay, the hat has a text on

it which reads Sunday Running Club.

This is an observation, but it
requires supporting evidence.

The second part, assertions, is
more of a subjective interpretation.

So, for example, he could say
like, the hat is greenish.

Um, which might be hard to verify if
you're alone only listening to it.

So you would have to basically
trust his assertion, but it always

requires confidence levels and
usually also update protocols.

Facts are immutable and
verified information.

And the immutable part
makes facts very rare.

And this is especially like you
need like a preservation protocol.

That the fact can persist
through time and isn't updated.

Otherwise, it would fall
in the other two buckets.

And these differentiations, we will go
deeper into that on how you can use them.

Basically, allow you to
distinguish a little bit

between static and dynamic data.

And you will realize that
most of your data is dynamic.

So you should treat it as such.

So you should add the time dimension
and monitor also the data freshness.

systematically to see whether data
has become stale and also review

and update your data regularly.

And when you get into the really advanced
space, I think what, what is really

interesting to me at the moment is,
for example, how can I make it that

The dependencies between two different
parts of the system, so for example,

the code and the documentation are
which both have different lineages

through time, how they are updated.

They also refer to each other because
if the code base changes, but my

documentation doesn't, I have a
break in the dependency and I can

only notice that if I take time
as an extra consideration into it.

And both of these

principles basically work
together, or both of these,

both of these frameworks.

So we should,

so good data isn't just about storage,
it's also about maintaining the

integrity and usefulness over time.

And today we will be looking
at How we can actually embrace

time as a core data dimension.

How we can differentiate between
observations, assertions, and facts.

But also how Daniel at Trustgraph
how they build their knowledge graph.

And how they basically favor simplicity
and modularity over like really

elaborate taxonomies and ontologies.

Let's do it.

Daniel Davis: Time doesn't really
factor into a lot of the ways that

things are done right now in terms
of architectures that we see in A.

Frameworks platforms, whatever it
is that people are calling them and

Nicolay Gerold: people

Daniel Davis: a lot of opportunity
for time and also just in.

Information itself, time is one of those
factors that often gets overlooked.

I've seen that so much
in my career, though.

I spent a lot of time in the risk world
doing, you know, risk assessments for

really complex systems, and I've argued
for years time is really important

factor of a risk assessment because.

Changes over time, you know, once you
do that assessment, you kind of planted

a flag in the ground and said, that's
this is as of today, 6 months from

now, 10 months from now, 2 years from
now, this is probably not true anymore.

And there's not a single.

Risk framework that I'm aware of that
actually takes time into account.

There's a lot of there's
a lot of work like that.

In fact, that just doesn't
take time into account.

And that's something that.

We at TrustGraph are really
looking to do in the near future.

Nicolay Gerold: How do you
actually want to integrate it?

Maybe let's start first with in with
the issue of knowledge graphs that

time really isn't considered, like
where does it come from and why isn't

putting time as an attribute As a type
of relationship, why isn't that enough?

Daniel Davis: It's a pretty difficult
problem to solve because if you think of a

really large document, you have a 200 page
PDF, because I have one I've been working

with a lot lately for some an article I'm
actually writing to do some comparisons.

It doesn't really have much time data in
it except maybe on the very first page.

You have a date in which the document
was issued, but once you get into

page 364, and the topics on that page,
there's no time information anymore,

so you don't get any information on how
fresh is this knowledge, or this data,

or this information, this observation,
this assertion, you don't get that data,

so the only way to really handle that
is In a lot of cases right now is, is

manually with document metadata, as
you're doing an ingest, you can begin to,

well, I can just preview metadata on, is
there a date somewhere in the document?

And at the very least, I can say, we
ingested this document at this point.

And so you have that point in time.

So I think there's, when you look
back at historical information, there

aren't really good technological
Methods and solutions to extract that

information that might not be there,
but that's why you want to get this

information into a system going forward.

That is able to manage that time.

Because once you get the data into the
system, you have to maybe plant that flag

in the ground and say, okay, we don't want
to have a lot of temporal data previous.

But now we do so now we can begin
to say, you know, it's it's fresh

as a now what happens going forward?

Are we seeing new information?

We're seeing updates.

Are we seeing.

Information that conflicts with
what we already have, and then you'd

be able you're able to start to
calculate how fresh is this data?

Is the data stale?

Is the data static?

Does the does it never change
or does it change slightly?

But I think that's something we're just
going to have to accept is that there's.

There's only so much we can do with the
past and we just have to really focus

on, you know, what we can do with this
information going forward, you know,

there's, it's just, it's a big transition.

You know, there's only, we've
seen big technology transitions

like this in the past where,
you know, it's a paradigm shift.

Part of that paradigm shift is
there's the before time, and

then there's the after time.

used the word time a lot, didn't I?

Nicolay Gerold: Hehehehe Yeah, you did.

Can you maybe first define like
the three different terms you used.

An observation, a fact, and an assertion.

And how you would define those.

And also How do you automatically
differentiate between them?

Daniel Davis: Two of those, I think,
have pretty straightforward definitions.

One does not.

So, observations and assertions.

Pretty straightforward.

An observation is I think about it a bit
like in the quantum mechanical sense in

that you don't really know the state of
a system until you make that observation.

So an observation would be you're
wearing a hat that says, I think it

says is to say Sunday running club.

Is that what it says?

Nicolay Gerold: Yeah.

Daniel Davis: Okay.

So that's that's an observation.

And an assertion would
be, I think it's green.

Which I think on my screen,
it looks kind of greenish.

So the distinction between those
as an observation is a recording

of something that I can measure.

Where an assertion is more of a
human opinion, you know, a human

statement that my opinion about
a particular time situation that

doesn't have a supporting observation.

So, that's the way I look at those 2.

An observation has to be
supported by some other piece of

information, whereas an assertion.

It's probably what is not now.

Fact.

Oh, boy.

This is something that even my, my co
founder and I we don't exactly agree

on the definition of this, but we're
working towards 1 that we mutually

agree on because it's, it's something
that can get so complicated so quickly.

And I find it so concerning that.

The majority of people today think
that's a really simple definition.

They think facts are these
black and white things.

And if you talk to anybody that's
worked with really complex information.

Worked in the intelligence world, for
instance, they will tell you that there's

no such thing as black and white fact.

Everything is in this gray area.

So I started giving this a lot of
thought, you know, what is a fact where

we can start thinking about facts as a
observation that is supported with an

evidence or some sort of information.

Okay, that's that's a
starting point, but Okay.

You know, what happens when it changes?

And we see this all the time.

You know, I think about the L.

You know, the California wildfires in L.

There's a lot of information
coming out about that.

People are making a lot of observations
and they're not necessarily

trying to be malicious, but.

The information changes over
time as we get more information.

Some of the observations may not have
been completely accurate, or there

might not have been fully detailed.

There might have been omissions.

There might have been disambiguation.

There might have been misunderstandings.

So how do you.

What do you consider a
fact in that situation?

I think that's a really good
example of the LA wildfires.

You know, what do we consider facts in
that very rapidly changing situation?

And that's where time comes in.

I started thinking about this and I went,
well, a fact is going to be something

that no matter how many times we talk
about it, it's never going to change.

It's static.

It's like geological bedrock.

You know, when you talk about
archaeologists or geologists and they

dig and they hit bedrock and they stop.

Why?

Because you've hit bedrock.

There's nowhere else to go.

You can't dig any farther.

And I think that's, I don't want
to say the only way, but I think

that's a way that we can look at it.

Act in a more granular way that
we can work with going forward in

time again, we have to use the word
time a lot is that it's something

that's never going to change.

Like, for instance, my observation
that you're wearing a hat

that says Sunday running club.

If I say during our podcast
episode, you're wearing a green

hat that says Sunday running club.

Well, that's never going to change.

So that's a fact.

Now, if I say your hat
well, what can I say?

Like if I made some sort of assertion
about how I think you're feeling right

now, though, you know, that could,
you know, that's an assertion and

also you may tell me later on, like,
well, I didn't really feel that way.

So that's an assertion
that may change over time.

Observation that may change over time.

But the fact, well, here, there I
go, see, we have all these words that

we can't, we struggle to describe
these situations without using the

word that we're trying to define.

So the observation that you're
wearing a green hat during our

recording, this says Sunday running
club, that's never going to change.

So that's static.

So, in our thinking, we would
classify that as a fact.

Now, you can get much more
granular, much more detailed.

You start getting degrees of
fact, degrees of truth, which.

I'm a little bit, frankly, the word
is scared to go down that road.

I want to create a relatively simple
system, at least at first, and I

think people can understand because
I think people can understand that.

I think people can understand
this piece of information.

It's never going to change.

It's etched in stone.

It always was true.

It always will be true.

And when you start thinking
about it that way, you realize

there aren't a lot of facts.

Most things can change.

And that's why fact The world of
fact, the world of information,

disinformation is so difficult to
understand because there is so little

information that is truly etched in stone.

Everything else is dynamic.

And when we start to.

It's except that except that information
is dynamic and keep track of it,

you know, keep track of how it's
changed over time and keep track of

how often did it change over time?

Is it changing over time?

Rapidly during this
period and not over here.

That gives us a lot of information
to be able to determine what do we.

Trust this information and, you know,
that's a lot of words that have ambiguous

controversial and conflicting definitions.

And I think that with AI and AI systems
and how we're trying to use those and.

The availability of information now,
these are imperative problems to solve.

Nicolay Gerold: And then for the rest
especially for the assertions are.

There are other factors coming into
play like consensus, the confidence

or the authority of the source.

Do you handle that on top of it as
well or do you want to handle it?

Okay.

Daniel Davis: I'm going to, I'm going
to give the politician answer and waffle

and say, I don't know, maybe I have

well, okay, let me, let me talk
through this for a 2nd, because my

mind is racing because we can go in
this and so many different directions,

but the value actually on both sides

from a system designer.

In a trustworthiness perspective, I
think the analytics within the system is

kind of where we're going to ultimately
land that on being the source of truth.

And here we are.

We're using words again
that we're trying to define.

However, knowing.

The confidence of your assertion could
be particularly useful in the system

because then all of a sudden we can
say, you know, you made an assertion

about what was happening in the
palisades on the during the wildfires

and you said it with high confidence.

And now we've got information
that conflicts with that.

So, we might say, well, wait
a minute, maybe this person is

a source or this new source.

You know, that affects their
trustworthiness for, you know,

they said they were really
confident about this information.

And now we've got conflicting information
and you can start to establish trends.

You know, how often does this happen?

Does this news source continually give
us information, give us assertions with

high confidence and that information
turns out to be, you know, there's

conflicts and there's changes.

So That is useful information for
the system to have, but I wouldn't,

okay,

that's a hard question.

These, these, these can go
in lots of different ways.

See, you're smiling too.

So I'm going to make an assertion.

You're enjoying this.

You're enjoying seeing me squirm.

I think,

you remember how I talked about there's
a before time and there's an after time.

I think the answer to this is we
have to take a Bayesian approach.

That first observation, you have
to take it for what it's you

have to take it at face value.

If we, if a news, if a source, and
again, and, you know, if we're using

the news, for instance, if the news.

makes an assertion with high confidence.

We have to take them
out there at face value.

Okay, this is a highly
confident assertion.

And then as the information gets into
the system and we get more information,

we get updates that information.

We can begin to go back and revisit.

Well, was that was that correct?

Or is that fair to make that in
that assertion with high confidence?

But at the same time, we actually
will need more information on

that particular topic to even
be able to come back to that.

Nicolay Gerold: Yeah,

Daniel Davis: with data.

The very, very 1st.

Step is not the same as all the
steps that come after it, because

there is that naive aspect of you're
trying to understand information.

You're trying to understand a
document and you have the I'm

getting it into the system.

So I don't know a lot about it,
but once it's in the system.

I do know a lot about it.

And, of course, you know, Bayesian
methods are kind of the mathematical

way we deal with these problems.

And that's something that's also, you
know, we've talked about for a long time

as being able to use more Bayesian models
to try to understand these kind of topics.

Nicolay Gerold: maybe To nail that a
little bit, Tom, this isn't just a news

thing, but also in a prior episode, I
talked to Max on the documentation in

Google, and that there are often outdated
components or things, like, where one

part of the system was updated and the
documentation with it, but a dependency

system, the documentation wasn't updated.

So you really have this temporal aspect.

One thing changed and the other thing
didn't and with a temporal component

and a component that actually maps that
dependency you could make that obvious

that it doesn't fit anymore it's something
is outdated but in like the current

scheme of things that you just have
basically i embedded once and i use it

forever this just doesn't work and the
same for what we recently worked on is

And they basically have documentation
on all of the different parts and all

of the different systems and often
different components are swapped

out, so the machine doesn't really
fit with documentation anymore.

So this is also something
you have to consider.

And the human part is mostly
the biggest point of failure.

Because when the documentation
isn't updated, you can't really

update it in your system.

Daniel Davis: Oh, that that resonates
on so many different levels, including

our own Trustcraft documentation.

I think I told you that, uh, we
have, there's lots of good intentions

about keeping it updated, but.

Pretty much exactly what you said.

You have these dependencies that
you've maybe forgotten about, or

it's not immediately obvious, and
you don't realize that's there.

Knowledge graphs are a great solution
for that problem, and that you see

those relationships so that, you
know, once you start to track things

temporally, you can say, you can look
at, like, when was this last updated?

And, oh, these 2 topics are
related, and this 1 hasn't been

updated for a very, very long time.

Maybe it should be updated.

But, you know, so many, so many things
come to mind when you talk about, you

know, how information evolves over time
and understanding that and don't really

have a good answer for why we don't do
this more now because, you know, the

world I came from in a past life and
the aerospace world, you know, time was

really, really critical and there was
a lot of effort that went into keeping

track of this kind of information.

Now, It wasn't done with
a technology solution.

It was mostly done by people, but
recording it sometimes in spreadsheets.

I wish that wasn't the answer, but it
really, really was because there was a

recognition that these issues are really,
really important is understanding we.

This system failed, and it's
related to all these other systems,

and we haven't seen anything
with this system in a long time.

Maybe we should, you know, check it out.

Maybe there's something
going wrong with it.

Or somebody mentioned to us recently
an interesting use case for what

we're doing aircraft maintenance.

I know a lot about that 1 and
that's actually a really big.

You train an aircraft almost like a person
with their health record, and you look

at their maintenance record, and time is
a huge aspect of it because you have the

Well, here are the things that you do 1st.

And do you want to see you want to check
that you want to see that checklist

and it can be really important of,
you know, what's the order of thing?

What's the order you did this in?

Did you did you check that thing
that we tell you to do 1st 1st?

Because if you didn't, we probably
need to go back to check that.

Or, you know, knowing
the relationships that.

Depending on what system you checked,
well, maybe we don't need to do that

because there's a sequencing order here.

You know, the sequence of events,
the, the freshness of information,

the staleness, because, you know,
that's another aspect of it.

You know, I remember there
was 1 particular aircraft.

I still remember this aircraft really,
really well because I had to travel to

go see it personally because it was.

Sick.

You know, we had, you know, months
and months and months of records,

and at some point, it's useful to
know what they did eight months ago.

But that was eight months ago.

How many times have they flown
this airplane since then?

It's not that it's, it's useful,
but you have to prioritize the

most current information more
so than the past information.

But, but you have to also correlate it.

Is there information in the
old information that's not

in the current information?

And again, that's a really important
distinction that You need that time aspect

when you're ingesting all this information
to be able to make those distinctions.

And as you said, with something like
vector embeddings and semantic similarity,

this in situ context is totally lost.

And even with knowledge graph
relationships, this is not something

that's going to be naturally
in a knowledge graph either.

It has to be manually added
and there has to be a in situ.

Schema or an approach on how
you're going to do that, whether

you're going to pin the metadata to
nodes, properties, relationships.

Again, that's.

Time is not if you look at
all of the schemas out there,

ontologies and taxonomy.

So, like, the big ones, like schema
dot org or, you know, owl or D.

once again, it focuses on
entity entity resolution.

You don't really see.

A lot about time.

And if you actually look at the big
knowledge graph database systems,

they don't give you a lot of tools
for dealing with time either.

That's something that you
kind of have to add manually.

Nicolay Gerold: And how do
you see integrating the time

aspect into the knowledge graph?

Do you see it still as metadata that's
attached to nodes and relationships?

Or do you see it integrated
in a different way?

Daniel Davis: I think the
bulk of it is metadata.

I think the, the difference would be
say, you have like the document that

I've been working on for this article.

It's a, it's an old aerospace
airworthiness document

called mill handbook 516.

if you're looking for something to
put you at sleep at night to cure your

insomnia, it won't take many pages of it.

And it was my, it was my
life for 1 of my jobs.

1 time was really understanding
the full breadth of it.

And, of course, it's going to have a date.

It's the very beginning.

It's, you know, when it was
published, when it went to effect.

And so I think that would be a
relationship in the graph that this

document called Mill Handbook 516.

Was published on, and you would have
a literal that would be the date.

So that wouldn't be metadata.

That would be an actual
relationship in the graph.

But once when we ingested and process the
document, that would be metadata and then.

When new data or new updates came in that
affected topics of nodes or relationships

in the graph, that would be new metadata.

So I think the bulk of it, metadata some
exceptions though, or for instance the

date that you, you know, you publish
this podcast, you know, that would be a

relationship podcast was published on,
and then you would have that date or news,

you know, this news was published on.

You know, X date and and that
you would actually keep that going

forward to even if there's new
updates, you would probably maintain

that because that would be useful.

But, you know, every time all the
processing you're doing in the system.

That would be metadata.

But metadata is really where
the intelligence process lives.

On processing raw data is all that
metadata you append to it, and

understanding the relationships between
the metadata and all the insights

and analytics and all the connections
that come from that data set.

Nicolay Gerold: I'm really torn how to
differentiate between these two, like,

when is it actually a relationship?

And when is it an attribute?

How do you like put your finger down?

Okay, in this case, it's a relationship.

In this case, it's just an
attribute, which I put on the node.

Daniel Davis: I think the simple
answer is we're just gonna

have to make a definition.

And see how it works out, I think this
is going to be a bit of an empirical

process of this is our approach.

We think this is the
best way to approach it.

And again, you know, Bayesian
you know, we make our.

You know, we make our assertion.

I mean, again, that's
the Bayesian process.

You make your Bayesian prior.

This is the system.

And you go forward and you test.

Was this a good assertion?

Was this a good definition?

And I'm fairly confident
it will change over time.

And, you know, philosophically, again
That's why this entire area is so complex,

is because you just quickly pointed
out a simple area of the system that we

can't really define in a black and white
way that everybody's going to agree on.

Already, already you figured that out.

And, you know, that's part of
my core thesis is that none of

this stuff is black and white.

Facts are not black and white
until you get something that

just those rare statements, those
rare pieces of information that

always were and always will be.

And those are pretty rare.

Everything else is somewhere in between.

And when you have things in between,
you get 100 people, you're going to

get a lot of different, you're going
to get a lot of different approaches.

And then you have varying
degrees of value out of those.

Nobody's going to be.

Why don't I say nobody?

See, again, those are absolutes.

People are going to be varying
degrees of right or wrong.

Again, binary choices.

There's going to be degrees of value
that come from these approaches.

And again, from a, from a business
perspective, from a system design

perspective, the goal is to drive
the most amount of value at The

least amount of cost to the user.

So that's going to be
where the trade off is.

And I think that's I think that's why we
kind of have to simplify a lot of this

because it can get so complicated so fast.

I had somebody asked me the other
day about you worry about duplicates

in your in your knowledge graphs.

And I went, there's a lot of people
to get really, really wrapped

around the axle on duplication.

And does it really impact the outputs and.

Oftentimes it doesn't.

I mean, if you look at any data store,
especially for any large company or or

customer data set, it's gonna be mountains
and mountains and mountains and mountains

of duplication and you can work with it.

So, how much time do you want to
spend on deduplication when your query

tools and your ability to access that?

data is allows you to get the
value that you need out of it.

So, you know, those are part
of the system design choices

of, yeah, this isn't perfect.

Yeah, there are duplicates,
but does it affect the outputs

we're getting in the system?

And how much time are we going to have
to spend To do this deduplication and I

think that's also that's a very similar
topic to what you're talking about here.

You can very go very, very easily go
down this rabbit hole of how to prevent

deduplication and spend huge amounts of
time and effort and to trying to do it.

And you go, what did we get out of it?

Nicolay Gerold: Yeah, and I think it
will depend in the end what do you put

as notes, what are the relationships,
what entities are you using?

How are you integrating
your temporal data?

It's like how, like what queries
do you actually need to support?

Because what are in the end the
query patterns of the different

users and this will inform in the
end like what relationships what

entities and what metadata do I need

Daniel Davis: You know, querying is
an interesting topic, but in and of

itself, too, because we find, just
take knowledge graphs, for instance.

You can do really sophisticated query
techniques with knowledge graphs.

You know, you can start finding, you
can, you can start querying on, I want to

know the shortest path between two nodes.

I want to know the longest
path between two nodes.

I want to know this kind of path.

I want to be able to do this many of hops.

You can find there's some really
sophisticated graph query techniques

out there, and I don't think
many people actually use them.

And I say that with, that's an assertion
with relatively high confidence

for me because I, I know so many
people whose life is selling graph

databases and you talk to them, you
know, what are your users saying?

And we find very few people use those
really sophisticated techniques.

It's more of, you know, I have a topic
and I want to get the, the first hop of

sub graph that's related to that topic.

Now, what's interesting is, not only do
people not really use those sophisticated

techniques much going forward, and I'm
sure, I'm already envisioning now lots of

comments from people telling me I'm wrong
on this, but we're about to do a, once

again, we're about to revisit the theme
of the before times and the after times.

I see use cases for those
really sophisticated techniques

with complex ontologies.

If you have this really complex ontology
and you know, the data is in this

structure, then there are use cases where
you might want to say, is this piece

of information, is it connected in any
way to this other piece of information?

And you might want to use some really
sophisticated techniques for that.

However, with AI, with language models
that we're now using, we extract and build

our knowledge graphs to be very, very
simple and very, very flat structure.

And the reason why is because I guess
I guess this is also something that I

should mention is that, you know, we say
graph rag, but we don't do pure graph rag.

You know, if the graph rag definition
is you're using an LLM to generate

cipher or GQL or sparkle queries.

That's not what we do.

We actually create mapped vector
embeddings, and then we use a semantic

similarity search on the query or the
request to say, hey, what are the,

what are the potential nodes that are
interested here, rank those, and then

generate subgraphs off of those with, you
know, pretty standard that there's like

an eight query per mutation that gets all
the relationships to build a subgraph.

So, you know, that's the way
that, that we approach it.

And if you approach it that way,
All of a sudden, do you really

need these complex ontologies?

You know, do I need to know,
like, if you look at schema.

org and all the, the depth that they
go to try to classify websites for

SEO, you know, do, do you need all
these, you know, product categories

and all these categories anymore?

Because we now have a tool with
semantic similarity search to know this

is the topic that we're looking for.

And if we can then cop correlate
that topic to all the other relation

information that's related to it
that's in the graph, do we need

these complicated ontologies and
taxonomies or schemas anymore?

And I would argue, no.

You can still ingest them in matter
of fact, 1 of our users is actually

working with sticks, which is a very
complex ontology and cybersecurity

world for threat intelligence.

But I think what we're seeing
is that we now have tools.

That don't require these really complex
ontologies and schemas that we can work

with data more at the relationship level.

And because of that, I, I think a lot
of these really, really complex query

algorithms are going to see even less use.

And I could be wrong.

Somebody could find some really, really,
really niche use case for some of

those that nobody's ever thought of.

But I don't see it at
the moment, which also.

Brings me to something that I also
kind of think, and that's Do you

really need a knowledge graph database?

That's where that's a really,
really interesting question,

because you can store a knowledge
graph in columnar tables.

There are ways to do that.

It's not that hard.

A matter of fact, that's what we
do with our Cassandra deployment.

You know, Cassandra is not a knowledge
graph database, so we're actually

writing knowledge graph triples into
sets of, you know, tables in Cassandra.

I mean, we also support, you know,
Memgraph and FalkerDB and Neo4j and

some others that we're working on
integrating, which are more traditional

knowledge graph databases, but you don't
even have to have a knowledge graph

database system to store these kind of
relationships and then extract them.

And again, it kind of goes back to the.

Maximize value from the system
while minimizing the cost that it

takes to design it and operate it.

And yeah, it's, it's, it's fun to
think about complex ontologies.

Is it, is it really,
is it really necessary?

And that's something that my co
founder Mark actually talks about.

I've asked him this question before
about why do people even Thank you.

Why do people develop these incredibly
complex ontologies and these

schemas when there are ways that
you can query the system without it?

And I think it's a little bit
of people just kind of like it.

You know, it's, it's something
that people found interesting.

It was, you know, you give people
a tool, they, oh, I can do all

these other things with it.

And nobody ever stopped them
and said, Well, do you need to

do you need to do those things?

Do we actually need that tool?

Or is the hammer fine?

And I think in this case, we're finding
that, um, pretty simple queries are

suit most people's needs and we can
now do it in a very sophisticated way.

Nicolay Gerold: I'm a little bit torn
in Because I think we get into a space

with LLMs that it's become way easier to
actually construct knowledge graphs But

when we use LLMs typically it's like very
free So any entities, any relationships

are extracted and inserted into the graph.

And because it has been trained on the
different schemas as well, I think LLMs

tend to have extraction entities and
relationships that are similar to schema.

org and the other big
taxonomies, ontologies.

Versus I think, more of a simple
homegrown solution where you focus

on A few core entities a few core
relationships which actually matter for

your system And you restrict it to those.

I would actually like to know Which of
the two sides are you leaning towards

a really open knowledge graph, which is
like ever expanding Or like more of the

constrained one, which is really tailor
made for the system We are now going

into like the black and white thing again

Daniel Davis: well, I would say to
address the 1st point you made about

LLMs, 1 of the things that we've found
consistently is that different LLMs

approach this problem very differently.

You're very right in that regard.

For instance, the llama family of models.

Seems to focus more on people, even if
you, if you ask it for entities, you're

probably going to get people and it's
kind of hard to get it off of that.

Actually, there's a lot of models that
if you ask it for entities and conceptual

topics, the list will be exactly the same.

It's actually a pretty small
number of models that we found

that can make that distinction.

So you know, I completely
agree with that regard.

Is that.

There's no perfect extraction
with an LLM right now.

There's no question about that.

And it can also be a little bit specific
on the knowledge because you're kind

of relying on, is this information in.

the training set for this model.

And here's a, here's a
good for instance on that.

Most LLMs that we've dealt
with do pretty well on RD.

Well, they do pretty well on Cypher
for sure, because there's a lot

of stuff about Cypher out there
on querying knowledge graphs.

So they do pretty good on Cypher.

RDF on the other hand which,
and you mentioned schema.

org, which actually is kind
of, which is based on RDF.

RDF, not so much.

The most common format that people
work with with RDF is turtle,

which is the most human readable.

And we found LLMs can't, we have
not found an LLM that can really

consistently output turtle.

And the reason why is if you just
do a Google search on RDF turtle,

you'll find blog posts where the
explanations are just flat out wrong.

And so you have, so Then you're
thinking about like, well, so how

think about the training process.

So garbage in garbage out.

So garbage went into the system.

So how do you fix that?

Well, somebody on the human, whatever
calling all these processes on the

back end where humans are manually
reviewing the data, whatever buzzword

people want to throw at those.

Somebody would actually have to be able
to look at that and know enough about

RDF Turtle to realize that this is wrong.

And I'm going to wager a decent amount
of money that there is nobody on the post

training side of companies like OpenAI and
Anthropic and Google who know enough about

RDF Turtle to realize this isn't right.

So you Right there, we can see these
models are not gonna be able to

do this consistently, and they're
not, and to answer your question,

our approach is we're trying to
take as broad approach as possible.

You know, we don't want to create
these very, very, very narrow,

complex ontologies and schemas.

You know, we've tried to address, we've
tried to develop trust guys from day

one to be as flexible, as modular, as
open as possible so people can take.

You know, these components and very
quickly tailor them to their needs.

And one of the things that we focus on is
what we call, you know, cognitive cores.

And which is basically a modular component
where we've done this extraction.

For instance, you know, Mill Handbook
516C, I've been playing around with that.

You know, if you go to our GitHub, we have
a cognitive core that you can download

and reload into the system in minutes
that we've already done the extraction on.

So you don't need to do that extraction.

So.

You know, part of our vision is that
you would be able to very quickly,

you know, you could share a cognitive
core you've made with me, and I could

share one that you made with you.

There can be a store where
we can download these.

You could upload ones that you've made.

You know, you could take your entire
podcast series, and you could create

cores from the transcripts and upload
those and people could use them.

And again, though, it's not perfect.

It's not black and white.

Even the cores that we've made for this
particular document that I keep talking

about, there are some relationships
that aren't necessarily connected.

That if you're able to really dig
into how it's stored in the graph, the

information is there, but sometimes
there are some missing links.

And, you know, that's why in, you
know, our kind of future roadmap does,

you know, working on tools that would
be able to give people that ability

to, you know, make edits going on.

So again, temporally.

Not only would there be the automated
side of that, where you're ingesting

new information, and if the information
supports the old information or changes

it somewhat or conflicts with it, making
those observations and recording all

that and doing all those calculations,
but also where the human can say.

No, I manually added this, or I did
this because this is my take on this.

And again, underlying though, there would
be metadata associated with all of that.

You know, is this an automated process?

Was this a human process?

Who made this who made this edit?

Who made this change recording
all that information?

And as you go over time, that enables you
to, I hate to use the word calculate, but

be able to come to a better understanding
of What sources, what knowledge

relationships you can trust in the system.

And again, that's something we could have
spent an entire, we could have spent days

talking about how much myself and my co
founder have spent in our lives, trying

to create metrics to calculate trust.

And it's not just us, many people have.

And I haven't seen one that has
really caught on because they

all have lots of pros and cons.

And the cons often.

They, they often hurt the pros so much
that it's difficult to use them, which

again, we're getting back to, you know,
what's what's the most amount of value

we can get from these, these systems, you
know, at the least amount of cost and.

And this kind of, I kind of, you know,
and I kind of think about in the Moby

Dick perspective, you know, the white
whale, some of these are white whales.

They people spend their entire
careers chasing after these topics.

You know, what's trust in a number or
what's risk in a number and they dedicate

decades to it and they do good work, but.

They never quite get there and I think,
you know, that's really important

to realize is, you know, how many
of these topics are white whales?

And I think there's a lot of
white whales in AI right now.

People are chasing.

These topics, which you're never
going to get there and we need to.

We're much better off accepting that
now and saying, you know, we're never

going to get to this point of perfection.

So what are approaches that.

Are usable, or how do we begin to
build on top of these processes

and approaches to get the necessary
amount of value out of it commensurate

with what we're putting into it.

Nicolay Gerold: I think that's
actually also one of my biggest pet

peeves in the space that tools always
just continue adding on features

until they are basically useless.

think we are lacking stuff in this space.

Open source project, but also
software products where they just

say, this is the feature set.

We do one thing really good.

We now have a feature set.

Now it's only maintenance
and we don't go beyond that.

And I think there are a lot of tools who
just they are like, they have found it.

They are really great solutions.

And now they just start adding like
useless stuff for additional use cases.

And I think we need more tools to just
say okay, this is what we are doing.

We are really great at that.

And we are not going beyond that.

Daniel Davis: Well, that's kind of a
bit of why we haven't worked on lots

of lots of connectors for TrustGraph.

So if you look at some of the other
frameworks out there, agent frameworks,

different frameworks they have lots of
connectors for all the back end systems

that your enterprise might be using.

You know, they have.

If you look at, they have a page
that you can scroll and scroll

and scroll, and that could be 50
different connectors, even more.

And that's 1 of those situations
that once you go down that road,

it's exactly what you said.

You will, you will spend so
much time maintaining all those

connectors because every time.

Those dependencies now update their
system, or they update their API, or

they make a change or add a new feature.

Now, that's where all your time is
going is maintaining those connections.

And so, to your point, are you really
focusing on the thing that you do?

Well, and that's where we've been trying
to focus is the things that we do well,

and not focusing on all those other
connectors, but I think there's also a

little bit of a sometimes an unrealistic
expectations from customers that.

The product should do everything
and they don't really.

Oh, for this to do everything,
it's probably not going to do

anything particularly well.

So it's, it's one of those.

I, I, I, I think it's interesting when,
you know, when I was growing up, you

know, people had Swiss army knives.

Do you remember those?

Because I don't ever see people
talk about those anymore.

Like, it was a big deal to
have a Swiss army knife and.

Nicolay Gerold: have them.

Daniel Davis: But do you ever use it?

Nicolay Gerold: Yeah, because
I often lose stuff like,

Daniel Davis: Okay, okay,

Nicolay Gerold: bottle
openers and stuff like that.

Yeah.

Daniel Davis: perspective.

Oh, wow.

It's a dream because
look, it does everything.

It does this.

It doesn't this.

You can make this, you know, amazing
content, but all the things it does.

But if you try to use it, you
realize that, yeah, you could

use it to open a bottle of wine.

But if you had an actual wine
opener, yeah, You're going to

choose every single time to use
that because it's the better tool.

It's, it's, it's better because
it's designed specifically for that.

So you do get these, you know, one side,
you get these, you know, tools that we

do it all, but do you really do it well?

And wouldn't you be better off just
doing, you know, taking the one that

does the one thing really, really well?

Nicolay Gerold: Yeah, and I think that
really leads into another topic we

talked about before, the API meshes.

That you said that a lot of AI is
actually just connecting different

APIs and actually calling them.

How does that actually fit
together with that view that

tools should be specialized?

Because when I use specialized
tools, I tend to have more of

like different modules, APIs,
whatever that I have to call.

for

Daniel Davis: is that so
many of the frameworks.

Or even platforms that we see out
there is when you really dive into

the code, you know, that's all they
are is that it's a it's glue code, you

know, because what is glue code glue
code is, you know, connecting, you

know, 2 different services together.

And.

When you have everything is
an API service, and you just

begin to connect them together.

You know, yes, you've, you've
kind of built this really

complicated mesh network in a way.

But then how does that network.

Operate on its own.

So you've kind of created this
mesh network with all these API

connectors or even even more linear
systems, more point to point.

So what's controlling all that
network flow underneath it?

And if you're not worried about
performance, if you're not worried

about having lots and lots of users
and lots of data, maybe you don't care.

And I think, you know, that's one of the
things I've seen in AI is so much of it.

What you see is demo where so much of
it came out of a hackathon, something

that people could build over a weekend,
something that came out of a, you know,

a notebook, a collab notebook that
people could really quickly get their

hands on and, you know, within an hour,
say, like, oh, this does this does X.

And it's really cool.

What does it take to take it beyond X?

Can it go beyond X?

You know, how do you take this to
having millions of concurrent users and

having petabytes of data underneath it?

And then what is the user experience?

You know, when you create these You
know, these single these mesh networks

with no control underneath it, you
create these bottlenecks of, you

know, somebody's going to be waiting.

I think about a state machine analogy
with an elevator in which it has to

transition from, you know, I found the
4th floor and I'm going to the 12th floor.

If somebody else is trying to
go down, it's not going to stop.

It's going to take me to the 12th floor
and then come back because that's just

the rules of how the states transition.

And, you know, that's something
that we just accept for an elevator.

That's just the way it works.

But customers for software products
don't tend to have that patience when

they're waiting because, oh, somebody
else's request are going through and

you just have to wait because that's
the way the system has been designed.

And to get your question I think
we have to there needs to be

more of an understanding of what
software infrastructure really

is and why it's important.

And there've been a couple of things
that I found shocking over the last year.

And because, you know, I
used to go to a lot of.

AI events here in SF.

I haven't had as much
time to do that lately.

And when you say words like
infrastructure so many developers now

look at you like, what do you mean?

Do you mean like AWS?

What do you mean?

Like, what do you mean infrastructure?

What does that what does that even mean?

And.

You know, TrustGraph, you know,
uses Apache Pulsar as our backbone.

And I'll say things like, you know,
we're using a PubSub backbone.

And this is, this is no joke.

I'm not kidding on this
because I actually count.

In the last year, I've met fewer than
five people that knew what PubSub was.

And this is Silicon Valley
engineers who are building an AI.

That is no joke.

That is no exaggeration.

Because I count.

And they tend to be actually people.

From some of the people that I've
met that actually knew that weren't

that actually were visiting San
Francisco from somewhere else.

And I, I'm just kind of,
I'm, I'm just flabbergasted.

I'm dumbfounded.

And I think that's because.

And, you know, I'm trying to not
to derail us here because I'm,

I'm about to spin up on a topic.

I could just just keep
on zooming on is that.

We've over the last decade or so,
we've decomposed software down

to these, the smallest component
possible to get it into a JIRA task.

And that's so many software engineers
lives now is they work on a very

particular set of features, maybe
one feature in this one service and

this one overall stack, and they
don't really touch anything else.

And so it's not that they're not.

They don't understand these topics.

It's not that they're
not technical enough.

It's just that they don't
ever see any of this stuff.

They don't ever see it.

They don't ever think about it.

And if you spend too long in these worlds
and these very, very granular topics,

you, you kind of lose the ability.

To see the big picture, because
you just haven't been, you just

haven't flexed that muscle.

You lose that ability to
think like, oh, wait a minute.

How does this affect, you know,
the site reliability team?

Like, what do those people even do?

Like, what is, what's the
stack underneath this?

You know, how.

You know, how does my service
relate to other services?

You know, what are they
actually running on?

How do we deploy these services?

How do we do, you know, how do we
have backups for these services?

How do we have failover?

How do we do any of these things?

And there are so, so, so few people
that think about those problems.

And.

Oftentimes companies don't invest
in it because it's not a feature.

It's not a, it's not something
that they can directly market

and sell to the customer.

So you only see it in people that have
worked in environments that have you

know, high reliability requirements.

So either, you know, highly
reliable requirements, high

availability you can't lose data.

You know, data loss is
really, really critical.

So unless you happen to work on
software that touch from those

systems, you're going to lose it.

You just kind of forgotten about these
issues or haven't even thought about them.

And, you know, I think we've got to
start thinking about these problems

at a more holistic level, you know, a
bigger picture, the systems level of how

these systems integrate to each other.

Because if you build a
really, really reliable.

Flexible, modular system where you can
plug tools in, then you can build the

best tool possible and it plugs right in.

And the thing is, this
stuff already exists.

It's out there.

You know, that's why
we chose Apache Pulsar.

Apache Pulsar is fantastic.

I mean, it's got all of
the nines and reliability.

It's got all the
modularity and flexibility.

Well, the way we use it in TrustGraph,
you can very easily take a software

module and hook it in to Pulsar, and all
you gotta do is, you either publish the

queue, I mean, you define your queues.

If, if you're trying to subscribe
to a queue that's already there,

you just subscribe to that queue.

If you want to create a new one, you just
create a queue and start publishing to it.

It's crazy simple how simple it is to
scale up on these services so that you

can focus on building this one tool
that's great at what it does, but you

have to have that foundation there.

And that's what I see totally lacking
in so many of these frameworks that just

focused on the higher levels of the stack.

You know, they focused on You know,
if you're thinking about a building,

you know, a high rise, they focused
on building out all the floors 1st,

you know, all the office space and all
the bathrooms and all the kitchens.

And they're great, except
that what's underneath it.

There's nothing here.

There's nothing well, what happens
when we try to scale it up?

It collapses and, and that's kind of
one of those big picture topics that

honestly, that's why, you know, I've been
doing more stuff like this, you know you

know, why we started doing our own, you
know, podcast series where we interview

people and why we've been more vocal is
to try to shift the narrative in this

space to say, like, you know, time out.

Yeah, this stuff, the stuff you guys
are doing is really, really cool.

But.

You're going to hit a brick wall.

Actually, you're not
going to hit a brick wall.

You're going to hit an impenetrable
wall that you can't tunnel through.

You can't tunnel underneath
and you can't climb it.

You're just going to be stuck.

And the only way to get around that
wall is going to be to turn around

and go in a different direction.

Nicolay Gerold: Yeah, I think it's
partly also there aren't many people

who actually get to see the problem
and come up with the requirements for

a solution what do I actually optimize
But the large majority of developers

actually sit at the end of the process.

They get told what do we have to achieve?

Or like rather the solution which
they have to program and aren't

really part of the decision making,
so they really can't learn it.

And one project I think which is really
interesting to study in that domain

is, for example, Tiger Beetle, which
is an open source database tailor

made for financial transactions.

And I think this is such an interesting
area because you can really see when

you go through the docs in their blog
posts like how they're optimizing for

that specific problem and they really
go from problem to architecture to like

the specific implementation, which is
really interesting to read through.

Daniel Davis: No, I completely agree
and this is what happens when you design

companies where you want to decompose.

Essentially people into replaceable
cogs in the machine and, you know,

that's kind of what we've ended up with.

Sweezes that you know, to
big tech companies, software

engineers aren't people.

They're, they're 5 years experience
with Python and this particular

stack for this particular use case.

And we've decomposed it such
that if This employee leaves,

we can just slide somebody
straight in and there's no change.

We don't see any change in our
productivity, but you do that long enough.

You get people with 10 years experience,
15 years experience that, as you say,

they haven't touched other parts of
the system, and it's not their fault.

It's not because they're not interested,
or they're not capable, it's just, in a

lot of cases, they're not even allowed,
or they don't even know that stuff exists.

You know, I've seen that time
and time again in tech companies

where there are teams doing work
that you didn't even know about.

I had times in my aerospace career where I
found out that There was a row of people,

you know, two cubicle rows over doing the
same thing I was and we didn't know about

each other or like there's a team across
the hall that are doing the same thing

for a completely different set of people
just found out about it by accident.

And you go, shouldn't
we be working together?

Yeah, you probably should.

But we're not gonna you go, wasn't
that going to cause problems?

No, aren't we going to like,
step on each other's toes?

You know, something we may do might
break something they're doing.

Yeah, and that's and that happens a
lot, you know, when people are doing

all this work in isolation, but,
you know, that's, I mean, you know,

all of a sudden, you know, we're
now to, you know, this is a rant

against, you know, company capitalism,
but, you know, when you prioritize

decomposing these tasks to that level.

You know, that's part of the
problem that you're going to see

and it's also short term thinking.

No, I've seen that change a lot,
you know, through my career.

I'm a bit older than I look to
maybe my benefit or detriment.

I don't know.

And.

You know, I think about how engineering
was approached when I first started

working and it's radically different.

Now, I remember you know, I started as a
junior engineer and you had, you know, a

senior engineer and a program manager and
part of the senior engineers job was to.

Be the tech lead, but was also to
mentor the junior engineers that

was part of their job that was
understood to be part of the job.

That's why you have senior engineers
and junior engineers on the team.

So that the junior engineers
get experience and they learn

that is something I haven't
seen for well over a decade.

Now is this structure where.

We recognize that people need to learn
and grow and be exposed to new topics.

Instead, we've chosen to value people
that are the absolute expert in

their very, very, very narrow niche.

And then we expect them to be able to do
all these other tasks outside of that.

Because, like, well,
they're the best at this.

Yeah, they're the best at that.

Very, very, very narrow niche.

That doesn't mean they're going
to be able to figure out all this

stuff that there's our counterpart.

That's been doing it for 10 years
and doing that only one thing.

So how can we possibly expect, you know,
if I did this one very, if I was an expert

on a single graph query algorithm in
Cypher and, you know, all of a sudden, you

know, you asked me to be able to design a
SQL database with X amount of reliability.

And you go, well, he's an expert
in knowledge database systems.

Like, well, no, I spent
10 years doing this.

And I didn't, by the way, just, you
know, you know, say that I spent,

you know, 10 years on this very, very
narrow algorithm for this one system.

I'm probably not going to be successful
in applying that to something else

when it takes somebody else doing
it for 10 years to know that.

And, you know, that's a, that's
a big challenge I see in the

tech world right now, which even
corresponds to, you know, what we're

doing from an information level.

It's the same thing.

You know, you're taking information in
these small chunks, you know, whether

it's social media or now using vector
embeddings, you're taking this information

and removing it and, you know, you're,
you're losing all that context, you're

losing all that context of how it was
initially connected within these systems,

the relationships, the underlying
relationships, how those relationships

are all connected, and you need all of
those connections to see the big picture.

And, you know, I think, you know,
archaeologists talk about this a lot.

If you study how, you know,
because that's, that's a term

I use a lot, in situ context.

You know, archaeologists talk about this
a lot when they do archaeological digs.

That's why, you know, they're so careful.

You know, they have these little like
brushes, like actually like paint brushes,

where they brush away the dirt, because
they're so terrified of Removing an object

from its context, because that's all the
information you have, you know, once you

find an artifact and you take it out of
the ground and you walk away with it.

You've lost that context.

Now you can't say, well, it was found
in this layer with these other artifacts

so that we can begin to say, well, these
probably come from the same time period

or the same event, or how are they
related because now you're holding it in

your hand and you've walked over here.

You don't have you thinking about
time again, you can't unwind that.

You know, you've removed it, you
removed its in situ context, and that's

really applicable here in the AI space.

Well, if you're just doing vector
rag, that's the same thing.

You're, you're taking these words or
these phrases, and you're removing

them from their in situ context,
and once you get them out of that,

they take on different meanings.

And if you don't have a way for it to
work, Reconnecting those relationships

from the from its original context.

That's how you get distant.

That's how you get, you know, I
don't want to say misinformation, but

you can get inaccurate statements.

You can get

statements that don't quite make sense.

You can get missing
details and all that stuff.

Is really difficult to spot unless you
really read every single statement at the

word level with a lot of in with a lot of
understanding about what you're looking

at, which is kind of 1 of the articles I'm
working out now is, you know, comparing

these results and saying that on the
surface, they all look good, you know, to

somebody that's not trained on the topic.

They all look good, but there
are different levels of quality

for lack of a better term again
on how we measure these things.

That we would attribute with each of their
responses from the different techniques.

Nicolay Gerold: Yeah, and I think what's
my biggest obsession in the space is

actually more learning patterns because
I think Ai especially as a field we steal

a lot from different areas like from data
engineering from platform engineering

from like like systems engineering as well
Like it's like a grab bag you put it all

together and I think when you actually
learn the patterns behind it You can make

way better decisions about, like, how
different systems should be put together.

And, I think that's also partly why
I want to do the podcast, because one

and a half hour of conversations, I
think you learn a lot from a person.

But I

Daniel Davis: because I understand
what it's coming from about so much.

I see what I call reinventing the wheel.

And I, I see so many people, you know,
for instance, building rag systems when

there are so many solutions out there.

There are many open source
solutions as many as many products.

And I go, why?

Why are you building it again?

Why are you rebuilding it?

And.

What I learned is it kind of goes
back to where we were talking about

a little bit ago is people, software
engineers jobs are so narrow.

Now they're, they're doing these projects
just to learn they're doing these

because they're so bored in their jobs.

They don't get to touch
any of these other systems.

And a lot of people are kind of
recognizing that they're going.

Wow.

I'm only working on this 1 really
narrow topic every single day.

And I want to try some of these other
things out, which, you know, then though.

There's some interesting effects
to this is that now we see all

of this code that's out there.

That's, that's kind of
just, it's a playground.

You know, people are playing with
ideas that they're experimenting.

And from an AI perspective.

You know, we know at least up until
now, and some people still think of

the scaling sense of, you know, pre
training more, more, more, more, more

is going to solve all of our problems.

So, you know, they could
be looking at this going.

Well, this is great.

All this new code is great.

I don't think so.

I think this is the garbage in
garbage out solution because now.

We have more code than ever before,
because so many people are doing

these experimental projects and it's
easier now than ever before to do it.

You know, at 1 time, you would
have to do a lot of research

on how to use all these tools.

And you'd have to manually write the code.

Whereas now you're not even having to
go to stack overflow to copy and paste.

You can get an LLM to generate a lot of
the code for you and it's working code.

Now, you, it may not work the way you
want it to, but it's working code,

so we see more code than ever before.

That's getting pushed out to, to
somewhere like GitHub than ever before.

But if you're trying to train
a model to be able to write

code, what coding practices.

Do you want to reward?

What are the best coding practices?

What are the best algorithms?

Which means how do you
even measure that anymore?

You know, we're going to come back
to some software engineers going

to have to be able to look at this
stuff and make a determination of A

is better than B is better than C.

And then you go, well, how did
you make that determination?

And a lot of it can come back to being
personal opinion or personal philosophy.

So it's a real, yeah.

Nicolay Gerold: that's actually
also an upside of Genitive AI.

Because I can now code much faster,
I can generate four different options

to solve the same problem, write the
different algorithms and test them.

Daniel Davis: That's true.

I don't see many people do that though.

See, my problem, so I don't think
a lot of people test their code.

See, I mean, that's kind
of, that's my observation.

Again, that's an observation.

And as over time, that observation
may or may not be, may not

have enough supporting ever.

Actually, that would be an assertion.

Excuse me.

That's an assertion.

That's my assertion based
on my observations, which

I can't support right now.

So that would be an assertion.

Is it?

I don't see a lot of
people doing that rigor.

I don't see people really, you
know, testing that because it's

a lot of work to do all that code
coverage and all that testing.

And it's not fun.

It's not fun.

But then you get into, you know, are
these the most effective ways to do this?

So I think you're right.

There are people like yourself that
see this as opportunity to wow.

Now I can quickly.

Do this 3 different ways and see
which way I like the best, or I

think the more common approach is.

LLM gave me code.

I copied and pasted.

I'm on to the next thing.

Nicolay Gerold: Yeah, I think if we could
drag this conversation on for like an hour

more, I think we have to go to the outro,
otherwise I Take too much of your time.

What is next?

What can you already teaser?

What's like on the
horizon for trust graph?

What are you guys building on?

Daniel Davis: Well, the thing that
we're really focused on right now

is our users have asked us about
better ways of managing data.

People want to be able to how they
load data in the system, be able to

granularly select what kind of topics
and data they have loaded so they can

unload, reload, do it dynamically.

Being able to have this Very reliable,
very modular, you know, infrastructure

running at all times and being able
to control what's deployed in it.

So they want more granular tools and that.

So basically, you know, you
know, infrastructure tools.

And then, though, we want to really
dig into what we're calling temporal

rag, which is going to be a big effort.

It's going to be, you know, I
kind of hinted at how we want to

approach it that, you know, facts,
observations and assertions.

I think we're going to start with
trying to create the, you know, create.

Keep those definitions

Nicolay Gerold: So what can we take away
if you want to apply this in production?

I think, first of all, building
specialized tools rather than building

one size fits the solution is, I
think, a very interesting approach.

I think this has been a loss, a
little bit lost in the ML and AI

and data space as well, that in the
beginning, tools focus on one specific

problem and solve it masterfully,
but then they go broader and broader.

And that's okay for a bunch of
different domains, but I think in

a lot of cases it would probably be
better if you stick with that area and

actually try to master it and become
the default tool, the de facto standard

for it, because you're so good at it.

And I think this has
been a little bit lost.

That you have single purpose tools,
domain specific applications, which

focus on one specific problems and focus
on the core functionalities, develop

clear interfaces for that, optimize
the performance, be a domain specific

solution and maintain the independence.

And I think with a lot of other tooling
and software solutions, it's overloading

a single tool, they get generalistic.

So they become like Excel, which in a
lot of cases, it's just like, you don't

even know what to do in it anymore.

And it forces a lot of generic
data use on top of you.

And the integration also becomes like
over complicated in a lot of cases.

So.

Instead of trying to build all in one
solution, I think you should think

about like building specialized tools.

And I think this framework is also really
interesting for, for software startups,

like where does it actually pay to
invest in specialized tooling as opposed

to adopting something more general.

And the, Tax decisions nowadays have
become more and more complex because

we have more and more open source
solutions out there, which are very

good solutions for a set of problems.

Um, the question is whether it aligns
with your specific problem at hand.

Um, and I think as engineers, we often
opt into like building something in

house, something specialized, but this
comes with a lot of cost and with a

lot of complexity down the road because
you might build something specialized.

For the moment, but this also has to be
maintained over time, and it has to be.

Adjusted features have to
be added to the use case.

And I think this is something that's
often ignored and these are just

tasks which are not often not directly
related to this specific problem.

You're solving for the
business or for the end user.

And I think specialized tools
or specialized software have

a place but you should be.

Really clear about what their purpose
are and why you are building them and

why you can use something that's already
been built before by other engineers.

And only if the reason is like,
okay, no one has done it before or

we need something so different to
actually solve the problem at hand.

Then you actually think about, okay,
I'm building something in house.

Also love daniel's point on like robust
infrastructure I think i'm always a

little bit torn like that like they
have built a lot on apache power sound

cassandra So both of them are very
scalable technologies I also talked to

daniel's co founder and He recycled a
lot apparently from a previous project.

So I think like Take it with a grain
of salt, whether this infrastructure

is the right one for you.

Yes, it's scalable, but at the same
time, they have built with this, with

these technical components before.

So I think always like the best tech
stack to build with is probably the

one you have built with before, because
you already have the experience, you

know, like how to set it up, how to put
it together, what are the downsides,

the performance characteristics,
and especially at the early stages.

You don't want to spend your
time thinking about the technical

components and how they work.

You want to spend your time thinking
about the problem and how to solve it.

And when you go with solutions
you haven't, or technologies you

haven't used before, you will spend
a lot of time actually figuring

them out and mastering them.

Um, so this is an extra caveat.

And then actually once
you hit performance.

Issues, then you can start to think
about, okay, I'm doing immigration.

I'm switching to a more powerful tool,
but at that point, you're likely.

able to hire more help, or hopefully,
uh, you have a little bit of

more time on your hand that you
actually can build a better system.

When we look at the knowledge graphs,
I really like the favor simplicity,

so keep it simple and keep it modular.

So, rather than relying on really
complex and nested ontologies,

like for example schema.

org, which has evolved over years.

Try first to build something simple,
design a graph with a few core entities

and a few core relationships that
capture the essence of the domain, and

then over time add more and more to it.

You might also end up at
something similar to schema.

org.

A lot of people have spent a lot of effort
figuring Out this schema, but for you,

you want to build something that works
for you to main very fast, but also in a

way that you can wrap your head around it.

And what I always like to say,
like a complex system that works

has likely evolved from a simple
system that worked before.

So start with a simple system.

And adopt or integrate complexity
over time when it's necessary and

not just for the sake of complexity.

So when you look at a new domain,
define the current entities, design for

flexibility, so still keep it extensible.

So this is also something like maintain
optionality when you're building stuff.

When I am thinking about, okay, how
am I structuring a project, what am

I, when am I implementing, it's always
interesting to think about, okay, how

can I set it up in a way that it's easy
to extend it, that it's easy to add

new functionality or features later.

And when you manage to do that,
it becomes easy to adopt something

more simplistic, because you know,
like, when I have to make it more

complex, I can't do it down the road.

And yep, um, I think that's it.

So let me know what you think of it.

Um, I would love to hear your comments,
reviews, negative or positive.

Also, if you liked it and you've
listened for like eight minutes

now to hear me brabble in this post
review, um, leave a review on Spotify.

On Apple podcast or wherever you're
listening, leave a like on YouTube.

It helps a lot and also recommend it
to your friends, recommend it to your

colleagues, to your students, and
that's the best way to make it grow.

And otherwise I will catch you
with another episode on Knowledge

Crafts next week, see you then.

Daniel Davis: pretty
simple at the beginning.

My inclination is to try to
keep those simple going forward.

And, you know, that can be pretty
sophisticated, you know, analytics

algorithms, you know, underpinning
how we calculate things like

trust or freshness going forward.

But, I think that we have to try
to simplify these kind of things

so that everybody can there
can be some sort of consensus.

Or otherwise, I don't think
it's going to drive value.

Because, like I said, I've seen this a
lot in people chasing the white whales

of trust indices or trust metrics, or
even risk the more sophisticated you

make it, the more complex you make it.

The less buying you get, because
people can kind of point to

exceptions to every rule.

So really kind of starting
with a really solid foundation of

how we want to go through this.

And also, you know, it is going to
require a Bayesian approach because,

like I said, there's the before times.

And then there are after times, you
know, once you get data into the system,

you can start thinking about time
differently than what happened in the

past, which is something I think we're
just going to have to accept is, you

know, we can't necessarily understand
past information at this point.

With the level of granularity we
will be able to going forward.

Nicolay Gerold: Nice and if people want
to start building the stuff we just

talked about or want to build on top of
trust graph Where would you point them?

Well, I, I don't know,
is this backwards or not?

Trust graph.ai.

That'll point you to our GitHub.

So, you know, that's where we
primarily do everything is in GitHub.

We have our Discord, we actually
have our own YouTube channels

where we do tutorial videos.

And I actually, you know, talk to
other people in the space as well.

But yeah.

Discord is always a good way
to get in contact with us, talk

about other people and some of the
things that they're working on.

You know, we just recently did a
poll and that's why those are the

things people were interested in.

There were a couple of things that I've I
really thought people would be interested

in and so far we haven't seen it.

You know, I, I kept boy, I, I've been
thinking about taking a multimodal

approach to this for a while now.

And so far people just.

that hasn't been what they're looking for.

I don't really have, I'm sure there's
somebody out there that's looking

for that and it's going to go like
multimodal, you know, agentic rag.

Yes, I'm interested, but that's not
really what our users have no, nobody is

the, that wasn't what anybody voted for.

Surprised me.

Cause that was one of my suggestions.

So what can we take away if you
want to apply this in production?