S2E17

#034 Rethinking Search Inside Postgres, From Lexemes to BM25

December 5, 2024 · 47:16

Many companies use Elastic or OpenSearch and use 10% of the capacity.

They have to build ETL pipelines.

Get data Normalized.

Worry about race conditions.

All in all. At the moment, when you want to do search on top of your transactional data, you are forced to build a distributed systems.

Not anymore.

ParadeDB is building an open-source PostgreSQL extension to enable search within your database.

Today, I am talking to Philippe Noël, the founder and CEO of ParadeDB.

We talk about how they build it, how they integrate into the Postgres Query engines, and how you can build search on top of Postgres.

Key Insights:

Search is changing. We're moving from separate search clusters to search inside databases. Simpler architecture, stronger guarantees, lower costs up to a certain scale.

Most search engines force you to duplicate data. ParadeDB doesn't. You keep data normalized and join at query time. It hooks deep into Postgres's query planner. It doesn't just bolt on search - it lets Postgres optimize search queries alongside SQL ones.

Search indices can work with ACID. ParadeDB's BM25 index keeps Lucene-style components (term frequency, normalization) but adds Postgres metadata for transactions. Search + ACID is possible.

Two storage types matter: inverted indices for text, columnar "fast fields" for analytics. Pick the right one or queries get slow. Integers now default to columnar to prevent common mistakes.

Mixing query engines looks tempting but fails. The team tried using DuckDB and DataFusion inside Postgres. Both were fast but broke ACID compliance. They had to rebuild features natively.

Philippe Noël:

Nicolay Gerold:

00:00 Introduction to ParadeDB 00:53 Building ParadeDB with Rust 01:43 Integrating Search in Postgres 03:04 ParadeDB vs. Elastic 05:48 Technical Deep Dive: Postgres Integration 07:27 Challenges and Solutions 09:35 Transactional Safety and Performance 11:06 Composable Data Systems 15:26 Columnar Storage and Analytics 20:54 Case Study: Alibaba Cloud 21:57 Data Warehouse Context 23:24 Custom Indexing with BM25 24:01 Postgres Indexing Overview 24:17 Fast Fields and Columnar Format 24:52 Lucene Inspiration and Data Storage 26:06 Setting Up and Managing Indexes 27:43 Query Building and Complex Searches 30:21 Scaling and Sharding Strategies 35:27 Query Optimization and Common Mistakes 38:39 Future Developments and Integrations 39:24 Building a Full-Fledged Search Application 42:53 Challenges and Advantages of Using ParadeDB 46:43 Final Thoughts and Recommendations

View episode transcript

Listen to How AI Is Built using one of many popular podcasting apps or directories.

← Previous · All Episodes · Next →

#034 Rethinking Search Inside Postgres, From Lexemes to BM25

Subscribe