Today’s guest is Mór Kapronczay. Mór is the Head of ML at superlinked. Superlinked is a compute framework for your information retrieval and feature engineering systems, where they turn anything into embeddings.
When most people think about embeddings, they think about ada, openai.
You just take your text and throw it in there.
But that’s too crude.
OpenAI embeddings are trained on the internet.
But your data set (most likely) is not the internet.
You have different nuances.
And you have more than just text.
So why not use it.
Some highlights:
- Text Embeddings are Not a Magic Bullet
➡️ Pouring everything into a text embedding model won't yield magical results ➡️ Language is lossy - it's a poor compression method for complex information
- Embedding Numerical Data
➡️ Direct number embeddings don't work well for vector search ➡️ Consider projecting number ranges onto a quarter circle ➡️ Apply logarithmic transforms for skewed distributions
- Multi-Modal Embeddings
➡️ Create separate vector parts for different data aspects ➡️ Normalize individual parts ➡️ Weight vector parts based on importance
A Multi-Vector approach can help you understand the contributions of each modality or embedding and give you an easier time to fine-tune your retrieval system without fine-tuning your embedding models by tuning your vector database like you would a search database (like Elastic).
Mór Kapronczay
Nicolay Gerold:
00:00 Introduction to Embeddings
00:30 Beyond Text: Expanding Embedding Capabilities
02:09 Challenges and Innovations in Embedding Techniques
03:49 Unified Representations and Vector Computers
05:54 Embedding Complex Data Types
07:21 Recommender Systems and Interaction Data
08:59 Combining and Weighing Embeddings
14:58 Handling Numerical and Categorical Data
20:35 Optimizing Embedding Efficiency
22:46 Dynamic Weighting and Evaluation
24:35 Exploring AB Testing with Embeddings
25:08 Joint vs Separate Embedding Spaces
27:30 Understanding Embedding Dimensions
29:59 Libraries and Frameworks for Embeddings
32:08 Challenges in Embedding Models
33:03 Vector Database Connectors
34:09 Balancing Production and Updates
36:50 Future of Vector Search and Modalities
39:36 Building with Embeddings: Tips and Tricks
42:26 Concluding Thoughts and Next Steps